Audio signal processing device and method for synchronizing speech and text by using machine learning model
US-2024321265-A1 · Sep 26, 2024 · US
US10424289B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10424289-B2 |
| Application number | US-201816103251-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 14, 2018 |
| Priority date | Nov 29, 2012 |
| Publication date | Sep 24, 2019 |
| Grant date | Sep 24, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A speech recognition system includes a phone classifier and a boundary classifier. The phone classifier generates combined boundary posteriors from a combination of auditory attention features and phone posteriors by feeding phone posteriors of neighboring frames of an audio signal into a machine learning algorithm to classify phone posterior context information. The boundary classifier estimates boundaries in speech contained in the audio signal from the combined boundary posteriors.
Opening claim text (preview).
What is claimed is: 1. A speech recognition system, comprising: a phone classifier configured to generate combined boundary posteriors from a combination of auditory attention features and phone posteriors by feeding phone posteriors of neighboring frames of an audio signal into a machine learning algorithm to classify phone posterior context information; and a boundary classifier configured to estimate boundaries in speech contained in the audio signal from the combined boundary posteriors. 2. The system of claim 1 , wherein the phone classifier is configured to feed both auditory attention features and the phone posteriors into a machine learning algorithm of a boundary classifier to output the combined boundary posteriors. 3. The system of claim 2 , wherein the machine learning algorithm of the boundary classifier is a three layer neural network. 4. The system of claim 1 , wherein the boundary classifier includes first and second boundary classifiers, wherein the phone classifier is configured to feed auditory attention features into a machine learning algorithm of the first boundary classifier to output a first set of boundary posteriors; and feed the phone posteriors into a machine learning algorithm of the second boundary classifier to output a second set of boundary posteriors. 5. The system of claim 4 , wherein the machine learning algorithm of the first boundary classifier is a three layer neural network, and the machine learning algorithm of the second boundary classifier is a three layer neural network. 6. The system of claim 4 , further comprising a calibration unit configured to calibrate the first set of boundary posteriors and the second set of boundary posteriors to determine relative weights for the first set of boundary posteriors and a second set of boundary posteriors; and assign the relative weights to the first set of boundary posteriors and the second set of boundary posteriors to output the combined boundary posteriors. 7. The system of claim 6 , wherein said calibration unit determines the relative weights using regression. 8. The method of claim 6 , wherein said calibration unit determines the relative weights using machine learning. 9. The system of claim 1 , further comprising one or more extracting units configured to extract acoustic features from each of the frames of the audio signal; and feed the acoustic features into a machine learning algorithm of a phone classifier to output the phone posteriors. 10. The system of claim 9 , wherein the machine learning algorithm of the phone classifier is a deep belief network. 11. The system of claim 9 , wherein the one or more extracting units extract log-Mel spectrum features. 12. The system of claim 9 , wherein the one or more extraction units are configured to extract auditory attention features from each of the frames of the audio signal by: determining an auditory spectrum for an input window of the audio signal; extracting one or more multi-scale features from the auditory spectrum, wherein each multi-scale feature is extracted using a separate two-dimensional spectro-temporal receptive filter; generating one or more feature maps corresponding to the one or more multi-scale features; extracting an auditory gist vector from each of the one or more feature maps; obtaining a cumulative gist vector through augmentation of each auditory gist vector extracted from the one or more feature maps; and generating the auditory attention features from the cumulative gist vector. 13. The system of claim 1 , further comprising a window unit configured to generate an input window of the audio signal by digitally sampling the audio signal for a segment of time corresponding to the input window. 14. The system of claim 1 wherein: the input window of the audio signal is generated from acoustic waves. 15. The system of claim 1 , wherein the estimated boundaries are syllable boundaries, vowel boundaries, phoneme boundaries, or a combination thereof. 16. The system of claim 1 , further comprising which a microphone converts acoustic waves that characterize an input window of sound into an electric signal to generate the audio signal. 17. The system of claim 1 , wherein the machine learning algorithm includes a neural network, support vector network (svn), Hidden Markov Model (HMM), or deep belief network (DBN). 18. The system of claim 1 , wherein the boundary classifier includes a machine learning algorithm configured to classify boundaries. 19. The system of claim 18 , wherein the machine learning algorithm included in the boundary classifier includes a neural network, nearest neighbor classifier, or decision tree. 20. The system of claim 1 , comprising: a processor; a memory; and computer coded instructions embodied in the memory and executable by the processor, wherein the computer coded instructions are configured to perform the functions of the phone classifier and boundary classifier when executed.
Related publications grouped by family.
Answers are generated from the same data shown on this page.