Method and system of automatic speech recognition using posterior confidence scores

US10403268B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10403268-B2
Application numberUS-201615260021-A
CountryUS
Kind codeB2
Filing dateSep 8, 2016
Priority dateSep 8, 2016
Publication dateSep 3, 2019
Grant dateSep 3, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system, article, and method include techniques of automatic speech recognition using posterior confidence scores.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method of automatic speech recognition, comprising: receiving a plurality of frames of audio data, wherein individual frames of the plurality of frames each have a frame term with a frame value representing a segment of human speech, generating, by at least one processor, activation values comprising inputting individual frame terms into an acoustic model and resulting in generation of a plurality of activation values for each frame term and generated at individual acoustic model states, wherein a plurality of activation values from a plurality of different frame terms is used to form a single phoneme acoustic score to be input to a language model; selecting, by at least one processor, those of the frame terms and those of the acoustic model states that are associated with a candidate utterance that has already been output by the language model and determined and selected after the already output utterance was formed; and generating, by at least one processor, confidence scores indicating whether the utterance is likely to be the language in the frame terms associated with the already output utterance and comprising generating a plurality of probabilities each formed by using at least one of the activation values of the selected frame terms and selected acoustic model states, and combining the probabilities; wherein generating a plurality of probabilities comprises at least determining a combination of versions of: (1) a current activation value of a current one of the frame terms; and (2) a difference between the current activation value and a maximum activation value among multiple activation values of multiple acoustic model states of the acoustic model receiving the frame value of the current frame term as an input. 2. The method of claim 1 comprising generating one of the plurality of probabilities for each frame term of the utterance. 3. The method of claim 1 comprising omitting the generation of probabilities, or omitting the probability in the combining of probabilities of frame terms, associated with silence or non-speech audio. 4. The method of claim 1 wherein generating a plurality of probabilities comprises generating the probability of a single acoustic model state given a single frame term by using a sum of a version of the activation values of the multiple acoustic model states for the single frame term, and repeated for multiple probabilities. 5. The method of claim 1 wherein generating the probability comprises a softmax normalization. 6. The method of claim 1 wherein determining the confidence score for a single utterance comprises using multiple frame terms each associated with multiple probabilities. 7. The method of claim 1 wherein the utterances are only those utterances that are located on a language model output word lattice. 8. The method of claim 1 comprising: obtaining the order of phonemes of the utterance from at least one language model; determining the association of frame terms to the phonemes of the utterance; and using frame values each corresponding to a frame term in an order to perform the generating of probabilities depending on the associations between the frame terms and the phonemes. 9. The method of claim 1 wherein the confidence score of the utterance is an average per frame probability related value. 10. The method of claim 1 wherein the confidence score of the utterance is an average per phoneme probability related value. 11. The method of claim 1 wherein the values of full phoneme acoustic scores that are associated with multiple frame terms are not used directly or indirectly to determine the confidence score; and wherein output score values of the language model are not used in equations to compute the confidence scores. 12. The method of claim 1 wherein generating a plurality of probabilities comprises determining the log of individual probabilities comprising combining: (1) an activation value of the individual acoustic model state of the acoustic model receiving an input of a frame value of a current frame term; (2) a log of the sum of the log's base to the exponent being the difference between the activation value and a maximum activation value among multiple activation values of multiple acoustic model states of the acoustic model receiving the frame value of the current frame term as an input, and summed for each of the acoustic model states in the multiple acoustic model states of the acoustic model of the current frame term; and (3) the maximum activation value. 13. The method of claim 1 wherein generating a plurality of probabilities comprises determining a combination of versions of: (1) the current activation value; (2) the difference between the activation value and the maximum activation value; and (3) the maximum activation value. 14. The method of claim 1 , wherein the probabilities and confidence scores are determined separately from the determination of which candidate utterances are output from the language model so that the probabilities and confidence scores do not contribute directly or indirectly to the outputting of candidate utterances from the language model. 15. A computer-implemented system of automatic speech recognition comprising: at least one acoustic signal receiving unit to receive an audio signal to be divided into a plurality of frames of audio data, wherein individual frames of the plurality of frames each have a frame term with a frame value that is a segment of human speech wherein one or more frame terms form a single phoneme; at least one processor communicatively connected to the acoustic signal receiving unit; at least one memory communicatively coupled to the at least one processor; and a posterior confidence score unit operated by the at least one processor and to operate by: generating, by at least one processor, activation values comprising inputting individual frame terms into an acoustic model and resulting in generation of a plurality of activation values for each frame term and generated at the individual acoustic model states, wherein a plurality of activation values from a plurality of different frame terms is used to form a single phoneme acoustic score to be input to a language model; selecting, by at least one processor, those of the frame terms and those of the acoustic model states that are associated with a candidate utterance that has already been output by the language model and determined and selected after the already output utterance was formed; and generating, by at least one processor, confidence scores indicating whether the utterance is likely to be the language in the frame terms associated with the already output utterance and comprising generating a plurality of probabilities each formed by using at least one of the activation values of the selected frame terms and selected acoustic model states, and combining the probabilities wherein generating a plurality of probabilities comprises at least determining a combination of versions of: (1) a current activation value of a current one of the frame terms; and (2) a difference between the current activation value and a maximum activation value among multiple activation values of multiple acoustic model states of the acoustic model receiving the frame value of the current frame term as an input. 16. The system of claim 15 wherein the confidence score is at least one of: an average probability per frame associated with the utterance, a value determined by initially determining an average per-frame probability of individual phonemes associated with the

Assignees

Inventors

Classifications

  • for discriminating voice from noise · CPC title

  • Speech classification or search · CPC title

  • G10L15/142Primary

    Hidden Markov Models [HMMs] · CPC title

  • Phonemes, fenemes or fenones being the recognition units · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10403268B2 cover?
A system, article, and method include techniques of automatic speech recognition using posterior confidence scores.
Who is the assignee on this patent?
Intel Ip Corp
What technology area does this patent fall under?
Primary CPC classification G10L15/142. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 03 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).