Speech recognition system using machine learning to classify phone posterior context information and estimate boundaries in speech from combined boundary posteriors

US10424289B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10424289-B2
Application numberUS-201816103251-A
CountryUS
Kind codeB2
Filing dateAug 14, 2018
Priority dateNov 29, 2012
Publication dateSep 24, 2019
Grant dateSep 24, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech recognition system includes a phone classifier and a boundary classifier. The phone classifier generates combined boundary posteriors from a combination of auditory attention features and phone posteriors by feeding phone posteriors of neighboring frames of an audio signal into a machine learning algorithm to classify phone posterior context information. The boundary classifier estimates boundaries in speech contained in the audio signal from the combined boundary posteriors.

First claim

Opening claim text (preview).

What is claimed is: 1. A speech recognition system, comprising: a phone classifier configured to generate combined boundary posteriors from a combination of auditory attention features and phone posteriors by feeding phone posteriors of neighboring frames of an audio signal into a machine learning algorithm to classify phone posterior context information; and a boundary classifier configured to estimate boundaries in speech contained in the audio signal from the combined boundary posteriors. 2. The system of claim 1 , wherein the phone classifier is configured to feed both auditory attention features and the phone posteriors into a machine learning algorithm of a boundary classifier to output the combined boundary posteriors. 3. The system of claim 2 , wherein the machine learning algorithm of the boundary classifier is a three layer neural network. 4. The system of claim 1 , wherein the boundary classifier includes first and second boundary classifiers, wherein the phone classifier is configured to feed auditory attention features into a machine learning algorithm of the first boundary classifier to output a first set of boundary posteriors; and feed the phone posteriors into a machine learning algorithm of the second boundary classifier to output a second set of boundary posteriors. 5. The system of claim 4 , wherein the machine learning algorithm of the first boundary classifier is a three layer neural network, and the machine learning algorithm of the second boundary classifier is a three layer neural network. 6. The system of claim 4 , further comprising a calibration unit configured to calibrate the first set of boundary posteriors and the second set of boundary posteriors to determine relative weights for the first set of boundary posteriors and a second set of boundary posteriors; and assign the relative weights to the first set of boundary posteriors and the second set of boundary posteriors to output the combined boundary posteriors. 7. The system of claim 6 , wherein said calibration unit determines the relative weights using regression. 8. The method of claim 6 , wherein said calibration unit determines the relative weights using machine learning. 9. The system of claim 1 , further comprising one or more extracting units configured to extract acoustic features from each of the frames of the audio signal; and feed the acoustic features into a machine learning algorithm of a phone classifier to output the phone posteriors. 10. The system of claim 9 , wherein the machine learning algorithm of the phone classifier is a deep belief network. 11. The system of claim 9 , wherein the one or more extracting units extract log-Mel spectrum features. 12. The system of claim 9 , wherein the one or more extraction units are configured to extract auditory attention features from each of the frames of the audio signal by: determining an auditory spectrum for an input window of the audio signal; extracting one or more multi-scale features from the auditory spectrum, wherein each multi-scale feature is extracted using a separate two-dimensional spectro-temporal receptive filter; generating one or more feature maps corresponding to the one or more multi-scale features; extracting an auditory gist vector from each of the one or more feature maps; obtaining a cumulative gist vector through augmentation of each auditory gist vector extracted from the one or more feature maps; and generating the auditory attention features from the cumulative gist vector. 13. The system of claim 1 , further comprising a window unit configured to generate an input window of the audio signal by digitally sampling the audio signal for a segment of time corresponding to the input window. 14. The system of claim 1 wherein: the input window of the audio signal is generated from acoustic waves. 15. The system of claim 1 , wherein the estimated boundaries are syllable boundaries, vowel boundaries, phoneme boundaries, or a combination thereof. 16. The system of claim 1 , further comprising which a microphone converts acoustic waves that characterize an input window of sound into an electric signal to generate the audio signal. 17. The system of claim 1 , wherein the machine learning algorithm includes a neural network, support vector network (svn), Hidden Markov Model (HMM), or deep belief network (DBN). 18. The system of claim 1 , wherein the boundary classifier includes a machine learning algorithm configured to classify boundaries. 19. The system of claim 18 , wherein the machine learning algorithm included in the boundary classifier includes a neural network, nearest neighbor classifier, or decision tree. 20. The system of claim 1 , comprising: a processor; a memory; and computer coded instructions embodied in the memory and executable by the processor, wherein the computer coded instructions are configured to perform the functions of the phone classifier and boundary classifier when executed.

Assignees

Inventors

Classifications

  • characterised by the type of extracted parameters · CPC title

  • G10L15/04Primary

    Segmentation; Word boundary detection · CPC title

  • using neural networks · CPC title

  • using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10424289B2 cover?
A speech recognition system includes a phone classifier and a boundary classifier. The phone classifier generates combined boundary posteriors from a combination of auditory attention features and phone posteriors by feeding phone posteriors of neighboring frames of an audio signal into a machine learning algorithm to classify phone posterior context information. The boundary classifier estimat…
Who is the assignee on this patent?
Sony Interactive Entertainment Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 24 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).