Automatic speech recognition
US-12002451-B1 · Jun 4, 2024 · US
US12354597B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12354597-B2 |
| Application number | US-202217822673-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 26, 2022 |
| Priority date | Oct 6, 2021 |
| Publication date | Jul 8, 2025 |
| Grant date | Jul 8, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes receiving a sequence of acoustic frames characterizing one or more utterances. At each of a plurality of output steps, the method also includes generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame of the sequence of acoustic frames, generating, by a prediction network of the speech recognition model, a hidden representation for a corresponding sequence of non-blank symbols output by a final softmax layer of the speech recognition model, and generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the dense representation generated by the prediction network, a probability distribution that the corresponding time step corresponds to a pause and an end of speech.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames characterizing one or more utterances; and at each of a plurality of time steps: generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; outputting, by a final softmax layer of the speech recognition model, a sequence of non-blank output symbols; generating, by a prediction network of the speech recognition model, a hidden representation for the sequence of non-blank output symbols; generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the hidden representation generated by the prediction network, a first probability distribution that the corresponding time step corresponds to a disfluency and an end of speech; and generating, by a second joint network of the speech recognition model different from the first joint network that receives the higher order feature representation generated by the encoder network and the hidden representation generated by the prediction network, a second probability distribution over possible speech recognition hypotheses, wherein the final softmax layer outputs, based on the second probability distribution, a next non-blank output symbol. 2. The computer-implemented method of claim 1 , wherein the operations further comprise: determining that the probability that the corresponding time step corresponds to the end of speech satisfies an end of speech threshold; and in response to determining that the probability that the corresponding time step corresponds to the end of speech satisfies the end of speech threshold, triggering a microphone closing event. 3. The computer-implemented method of claim 1 , wherein the operations further comprise: determining that the probability that the corresponding time step corresponds to the disfluency satisfies a disfluency threshold; and emitting a disfluency token at the corresponding time step based on the determining that the probability of the corresponding time step corresponds to the disfluency satisfies the disfluency threshold. 4. The computer-implemented method of claim 1 , wherein the speech recognition model is trained by a two-stage training process, the two-stage training process comprising: a first stage that trains the encoder network, the prediction network, and the second joint network on a speech recognition task; and a second stage that initializes and fine-tunes the first joint network to learn how to predict pause and end of speech locations in utterances. 5. The computer-implemented method of claim 4 , wherein parameters of the encoder network, the prediction network, and the second joint network are frozen during the second stage of the two-stage training process. 6. The computer-implemented method of claim 4 , wherein the two-stage training process trains the speech recognition model on a plurality of transcribed training utterances having labels indicating pause and end of speech locations. 7. The computer-implemented method of claim 1 , wherein the encoder network comprises a stack of self-attention blocks. 8. The computer-implemented method of claim 7 , wherein the stack of self-attention blocks comprises a stack of conformer blocks or a stack of transformer blocks. 9. The computer-implemented method of claim 1 , wherein generating the hidden representation for the sequence of non-blank output symbols comprises: for each non-blank output symbol in the sequence of non-blank output symbols received as input at the corresponding time step: generating, by the prediction network, using a shared embedding matrix, an embedding of the non-blank output symbol; assigning, by the prediction network, a respective position vector to the non-blank output symbol; and weighting, by the prediction network, the embedding proportional to a similarity between the embedding and the respective position vector; and generating, as output from the prediction network, a single embedding vector at the corresponding time step, the single embedding vector based on a weighted average of the weighted embeddings, the single embedding vector comprising the hidden representation. 10. The computer-implemented method of claim 9 , wherein the prediction network comprises a multi-headed attention mechanism, the multi-headed attention mechanism sharing the shared embedding matrix across each head of the multi-headed attention mechanism. 11. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that, when executed on the data processing hardware, causes the data processing hardware to perform operations comprising: receiving a sequence of acoustic frames characterizing one or more utterances; and at each of a plurality of output steps: generating, by an encoder network of a speech recognition model, a higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames; outputting, by a final softmax layer of the speech recognition model, a sequence of non-blank output symbols; generating, by a prediction network of the speech recognition model, a hidden representation for the sequence of non-blank output symbols; generating, by a first joint network of the speech recognition model that receives the higher order feature representation generated by the encoder network and the hidden representation generated by the prediction network, a first probability distribution that the corresponding time step corresponds to a disfluency and an end of speech; and generating, by a second joint network of the speech recognition model different from the first joint network that receives the higher order feature representation generated by the encoder network and the hidden representation generated by the prediction network, a second probability distribution over possible speech recognition hypotheses, wherein the final softmax layer outputs, based on the second probability distribution, a next non-blank output symbol. 12. The system of claim 11 , wherein the operations further comprise: determining that the probability that the corresponding time step corresponds to the end of speech satisfies an end of speech threshold; and in response to determining that the probability that the corresponding time step corresponds to the end of speech satisfies the end of speech threshold, triggering a microphone closing event. 13. The system of claim 11 , wherein the operations further comprise: determining that the probability that the corresponding time step corresponds to the disfluency satisfies a disfluency threshold; and emitting a disfluency token at the corresponding time step based on the determining that the probability of the corresponding time step corresponds to the disfluency satisfies the disfluency threshold. 14. The system of claim 11 , wherein the speech recognition model is trained by a two-stage training process, the two-stage training process comprising: a first stage that trains the encoder network, the prediction network, and the second joint network on a speech recognition task; and a second stage that initializes and fine-tunes the first joint network to learn how to predict pause and end of speech locations in utterances. 15. The system of claim 14 , wherein parameters of the enco
Recognition networks (G10L15/142, G10L15/16 take precedence) · CPC title
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Training · CPC title
using natural language modelling · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.