Universal lexical analyzers
US-2018165273-A1 · Jun 14, 2018 · US
US11646011B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11646011-B2 |
| Application number | US-202217846287-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 22, 2022 |
| Priority date | Nov 28, 2018 |
| Publication date | May 9, 2023 |
| Grant date | May 9, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and systems for training and/or using a language selection model for use in determining a particular language of a spoken utterance captured in audio data. Features of the audio data can be processed using the trained language selection model to generate a predicted probability for each of N different languages, and a particular language selected based on the generated probabilities. Speech recognition results for the particular language can be utilized responsive to selecting the particular language of the spoken utterance. Many implementations are directed to training the language selection model utilizing tuple losses in lieu of traditional cross-entropy losses. Training the language selection model utilizing the tuple losses can result in more efficient training and/or can result in a more accurate and/or robust model—thereby mitigating erroneous language selections for spoken utterances.
Opening claim text (preview).
What is claimed is: 1. A method implemented by one or more processors, the method comprising: generating a plurality of training examples, wherein generating each of the training examples is based on corresponding audio data that captures a corresponding human utterance, and a corresponding label that indicates a corresponding spoken language of the corresponding human utterance, the corresponding spoken language being one of N different languages to be recognized, wherein N is an integer that is greater than ten, and wherein each of the training examples comprises: corresponding training example input comprising: corresponding features of the corresponding audio data, and corresponding training example output comprising: a corresponding labeled probability metric for each of the N different languages to be recognized, wherein the corresponding labeled probability metrics include, based on the corresponding label, a corresponding positive probability metric label that corresponds to the corresponding spoken language, and a corresponding negative probability metric label for all other of the corresponding labeled probability metrics; training a language selection model based on the training examples, training the language selection model comprising: processing the corresponding features of the corresponding training example inputs of the training examples using the language selection model to generate corresponding predicted probabilities for each of the N different languages, generating corresponding tuple losses based on the generated corresponding predicted probabilities and the corresponding labeled probability metrics, and updating weights of the language selection model to generate an updated language selection model by using the generated corresponding tuple losses; and subsequent to training the language selection model: receiving, via at least one microphone of a computing device, current audio data that captures a current spoken utterance from a user; extracting one or more features of the current audio data that captures the current spoken utterance; processing, using the updated language selection model, the one or more features of the current audio data to generate current predicted probabilities for each of the N different languages; identifying M candidate languages for the spoken utterance based on data associated with the current audio data, the computing device, and/or the user, wherein the M candidate languages comprise two or more languages, and are a subset of the N different languages; selecting, from the M candidate languages, a current spoken language, wherein selecting the current spoken language is based on comparison of the current predicted probabilities for the M candidate languages; and performing speech-to-text processing of the current audio data based on the selected current spoken language. 2. The method of claim 1 , further comprising: receiving, in a transmission with the current audio data, an indication of the M candidate languages, wherein identifying the M candidate languages is based on the data associated with the current audio data, and wherein the data includes the indication of the M candidate languages that is received in the transmission with the current audio data. 3. The method of claim 1 , wherein identifying the M candidate languages for the spoken utterance is based on data associated with the current audio data. 4. The method of claim 3 , wherein identifying the M candidate languages for the spoken utterance is further based on data associated with the computing device and/or the user. 5. The method of claim 1 , wherein identifying the M candidate languages for the spoken utterance is based on data associated with the computing device. 6. The method of claim 5 , wherein identifying the M candidate languages for the spoken utterance is further based on data associated with the current audio data and/or the user. 7. The method of claim 1 , wherein identifying the M candidate languages for the spoken utterance is based on data associated with the user. 8. The method of claim 7 , wherein identifying the M candidate languages for the spoken utterance is further based on data associated with the current audio data and/or the computing device. 9. The method of claim 1 , further comprising: in response to identifying M candidate languages for the spoken utterance: initiating first speech-to-text processing of the current audio data using a first speech recognition model for a first candidate language of the M candidate languages, and initiating second speech-to-text processing of the current audio data using a second speech recognition model for a second candidate language of the M candidate languages. 10. The method of claim 9 , wherein selecting the current spoken language from the M candidate languages is simultaneous with the first speech-to-text processing and the second speech-to-text processing. 11. The method of claim 10 , wherein selecting the current spoken language from the M candidate languages occurs prior to completion of the first speech-to-text processing and the second speech-to-text-processing. 12. The method of claim 11 , wherein performing speech-to-text processing of the current audio data based on the selected current spoken language comprises: in response to determining that the current spoken utterance is in the first candidate language: completing the first speech-to-text processing; and halting the second speech-to-text processing prior to completion of the second speech-to-text processing. 13. The method of claim 11 , wherein performing speech-to-text processing of the current audio data based on the selected current spoken language comprises: in response to determining that the current spoken utterance is in the second candidate language: completing the second speech-to-text processing; and halting the first speech-to-text processing prior to completion of the first speech-to-text processing. 14. The method of claim 1 , further comprising: causing output generated during the speech-to-text processing to be utilized in generating content. 15. The method of claim 14 , wherein causing the output generated during the speech-to-text processing to be utilized in generating the content comprises causing the output to be provided for visual presentation via a display of the computing device. 16. The method of claim 14 , wherein causing the output generated during the speech-to-text processing to be utilized in generating the content comprises causing the output to be utilized in generating responsive content that is responsive to the current spoken utterance captured in the current audio data. 17. A system comprising: at least one processor; and memory storing instructions that, when executed by the at least one processor, cause the at least processor to: generate a plurality of training examples, wherein generating each of the training examples is based on corresponding audio data that captures a corresponding human utterance, and a corresponding label that indicates a corresponding spoken language of the corresponding human utterance, the corresponding spoken language being one of N different languages to be recognized, wherein N is an integer that is greater than ten, and wherein each of the training examples comprises: corresponding training example input comprising: corresponding features of the corresponding audio data, and corresponding training example output comprising: a corresponding labeled probability metric for each of the N different languages to be recognized,
Supervised learning · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Language recognition · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.