Compressed finite state transducers for automatic speech recognition
US-9865254-B1 · Jan 9, 2018 · US
US10403268B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10403268-B2 |
| Application number | US-201615260021-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 8, 2016 |
| Priority date | Sep 8, 2016 |
| Publication date | Sep 3, 2019 |
| Grant date | Sep 3, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system, article, and method include techniques of automatic speech recognition using posterior confidence scores.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method of automatic speech recognition, comprising: receiving a plurality of frames of audio data, wherein individual frames of the plurality of frames each have a frame term with a frame value representing a segment of human speech, generating, by at least one processor, activation values comprising inputting individual frame terms into an acoustic model and resulting in generation of a plurality of activation values for each frame term and generated at individual acoustic model states, wherein a plurality of activation values from a plurality of different frame terms is used to form a single phoneme acoustic score to be input to a language model; selecting, by at least one processor, those of the frame terms and those of the acoustic model states that are associated with a candidate utterance that has already been output by the language model and determined and selected after the already output utterance was formed; and generating, by at least one processor, confidence scores indicating whether the utterance is likely to be the language in the frame terms associated with the already output utterance and comprising generating a plurality of probabilities each formed by using at least one of the activation values of the selected frame terms and selected acoustic model states, and combining the probabilities; wherein generating a plurality of probabilities comprises at least determining a combination of versions of: (1) a current activation value of a current one of the frame terms; and (2) a difference between the current activation value and a maximum activation value among multiple activation values of multiple acoustic model states of the acoustic model receiving the frame value of the current frame term as an input. 2. The method of claim 1 comprising generating one of the plurality of probabilities for each frame term of the utterance. 3. The method of claim 1 comprising omitting the generation of probabilities, or omitting the probability in the combining of probabilities of frame terms, associated with silence or non-speech audio. 4. The method of claim 1 wherein generating a plurality of probabilities comprises generating the probability of a single acoustic model state given a single frame term by using a sum of a version of the activation values of the multiple acoustic model states for the single frame term, and repeated for multiple probabilities. 5. The method of claim 1 wherein generating the probability comprises a softmax normalization. 6. The method of claim 1 wherein determining the confidence score for a single utterance comprises using multiple frame terms each associated with multiple probabilities. 7. The method of claim 1 wherein the utterances are only those utterances that are located on a language model output word lattice. 8. The method of claim 1 comprising: obtaining the order of phonemes of the utterance from at least one language model; determining the association of frame terms to the phonemes of the utterance; and using frame values each corresponding to a frame term in an order to perform the generating of probabilities depending on the associations between the frame terms and the phonemes. 9. The method of claim 1 wherein the confidence score of the utterance is an average per frame probability related value. 10. The method of claim 1 wherein the confidence score of the utterance is an average per phoneme probability related value. 11. The method of claim 1 wherein the values of full phoneme acoustic scores that are associated with multiple frame terms are not used directly or indirectly to determine the confidence score; and wherein output score values of the language model are not used in equations to compute the confidence scores. 12. The method of claim 1 wherein generating a plurality of probabilities comprises determining the log of individual probabilities comprising combining: (1) an activation value of the individual acoustic model state of the acoustic model receiving an input of a frame value of a current frame term; (2) a log of the sum of the log's base to the exponent being the difference between the activation value and a maximum activation value among multiple activation values of multiple acoustic model states of the acoustic model receiving the frame value of the current frame term as an input, and summed for each of the acoustic model states in the multiple acoustic model states of the acoustic model of the current frame term; and (3) the maximum activation value. 13. The method of claim 1 wherein generating a plurality of probabilities comprises determining a combination of versions of: (1) the current activation value; (2) the difference between the activation value and the maximum activation value; and (3) the maximum activation value. 14. The method of claim 1 , wherein the probabilities and confidence scores are determined separately from the determination of which candidate utterances are output from the language model so that the probabilities and confidence scores do not contribute directly or indirectly to the outputting of candidate utterances from the language model. 15. A computer-implemented system of automatic speech recognition comprising: at least one acoustic signal receiving unit to receive an audio signal to be divided into a plurality of frames of audio data, wherein individual frames of the plurality of frames each have a frame term with a frame value that is a segment of human speech wherein one or more frame terms form a single phoneme; at least one processor communicatively connected to the acoustic signal receiving unit; at least one memory communicatively coupled to the at least one processor; and a posterior confidence score unit operated by the at least one processor and to operate by: generating, by at least one processor, activation values comprising inputting individual frame terms into an acoustic model and resulting in generation of a plurality of activation values for each frame term and generated at the individual acoustic model states, wherein a plurality of activation values from a plurality of different frame terms is used to form a single phoneme acoustic score to be input to a language model; selecting, by at least one processor, those of the frame terms and those of the acoustic model states that are associated with a candidate utterance that has already been output by the language model and determined and selected after the already output utterance was formed; and generating, by at least one processor, confidence scores indicating whether the utterance is likely to be the language in the frame terms associated with the already output utterance and comprising generating a plurality of probabilities each formed by using at least one of the activation values of the selected frame terms and selected acoustic model states, and combining the probabilities wherein generating a plurality of probabilities comprises at least determining a combination of versions of: (1) a current activation value of a current one of the frame terms; and (2) a difference between the current activation value and a maximum activation value among multiple activation values of multiple acoustic model states of the acoustic model receiving the frame value of the current frame term as an input. 16. The system of claim 15 wherein the confidence score is at least one of: an average probability per frame associated with the utterance, a value determined by initially determining an average per-frame probability of individual phonemes associated with the
for discriminating voice from noise · CPC title
Speech classification or search · CPC title
Hidden Markov Models [HMMs] · CPC title
Phonemes, fenemes or fenones being the recognition units · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.