Method and Apparatus for Multi-Lingual End-to-End Speech Recognition
US-2019189111-A1 · Jun 20, 2019 · US
US12254865B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12254865-B2 |
| Application number | US-202418418246-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 20, 2024 |
| Priority date | Nov 21, 2018 |
| Publication date | Mar 18, 2025 |
| Grant date | Mar 18, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method of jointly performing speech recognition and language prediction using a sequence-to-sequence speech recognition model, the method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving audio data characterizing a spoken utterance; processing, using the sequence-to-sequence speech recognition model, the audio data to generate, at each of a plurality of time steps: a probability distribution over a predetermined set of linguistic units; and a predicted language of the spoken utterance among multiple different languages the speech recognition model has been trained to recognize; and providing, as an output from the sequence-to-sequence speech recognition model, a transcription of the utterance based on the probability distribution over the predetermined set of linguistic units and the predicted language generated at each of the plurality of time steps, wherein the speech recognition model is trained using multi-task learning using: a first objective function corresponding to grapheme prediction; and a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction. 2. The computer-implemented method of claim 1 , wherein the sequence-to-sequence speech recognition model is trained using multi-task learning to teach the sequence-to-sequence speech recognition model to learn how to jointly predict linguistic units and language from input audio data. 3. The computer-implemented method of claim 1 , wherein the sequence-to-sequence speech recognition model is trained using multi-task learning by: obtaining training data comprising: training audio data characterizing training utterances each spoken in one of a plurality of different languages; and for each training utterance, a corresponding target output label sequence corresponding to a transcription of the corresponding training utterance, wherein each target output label sequence is annotated with a special language symbol indicating the language of the corresponding training utterance; and training the sequence-to-sequence speech recognition model on the training data to learn how to jointly predict the target output label sequence and the special language symbol for each training utterance. 4. The computer-implemented method of claim 3 , wherein training the speech recognition model on the training data causes the speech recognition model to: output scores indicative of special language symbols representing the multiple different languages of the utterances; and generate output sequences that include one of the special language symbols representing the multiple different languages. 5. The computer-implemented method of claim 1 , wherein the sequence-to-sequence speech recognition model comprises an encoder and a decoder. 6. The computer-implemented method of claim 5 , wherein the encoder comprises one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple different languages. 7. The computer-implemented method of claim 5 , wherein the decoder comprises one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple different languages. 8. The computer-implemented method of claim 1 , wherein the linguistic units are word pieces. 9. The computer-implemented method of claim 1 , wherein the linguistic units are graphemes. 10. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware causes the data processing hardware to perform operations comprising: receiving audio data characterizing a spoken utterance; processing, using the sequence-to-sequence speech recognition model, the audio data to generate, at each of a plurality of time steps: a probability distribution over a predetermined set of linguistic units; and a predicted language of the spoken utterance among multiple different languages the speech recognition model has been trained to recognize; and providing, as an output from the sequence-to-sequence speech recognition model, a transcription of the utterance based on the probability distribution over the predetermined set of linguistic units and the predicted language generated at each of the plurality of time steps, wherein the speech recognition model is trained using multi-task learning using: a first objective function corresponding to grapheme prediction; and a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction. 11. The system of claim 10 , wherein the sequence-to-sequence speech recognition model is trained using multi-task learning to teach the sequence-to-sequence speech recognition model to learn how to jointly predict linguistic units and language from input audio data. 12. The system of claim 10 , wherein the sequence-to-sequence speech recognition model is trained using multi-task learning by: obtaining training data comprising: training audio data characterizing training utterances each spoken in one of a plurality of different languages; and for each training utterance, a corresponding target output label sequence corresponding to a transcription of the corresponding training utterance, wherein each target output label sequence is annotated with a special language symbol indicating the language of the corresponding training utterance; and training the sequence-to-sequence speech recognition model on the training data to learn how to jointly predict the target output label sequence and the special language symbol for each training utterance. 13. The system of claim 12 , wherein training the speech recognition model on the training data causes the speech recognition model to: output scores indicative of special language symbols representing the multiple different languages of the utterances; and generate output sequences that include one of the special language symbols representing the multiple different languages. 14. The system of claim 10 , wherein the sequence-to-sequence speech recognition model comprises an encoder and a decoder. 15. The system of claim 14 , wherein the encoder comprises one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple different languages. 16. The system of claim 14 , wherein the decoder comprises one or more neural network layers that have parameters learned through training using training examples representing speech in the multiple different languages. 17. The system of claim 10 , wherein the linguistic units are word pieces. 18. The system of claim 10 , wherein the linguistic units are graphemes.
to the speaker · CPC title
using artificial neural networks · CPC title
Creating reference templates; Clustering · CPC title
Training · CPC title
Language recognition · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.