Training and/or using a language selection model for automatically determining language for speech recognition of spoken utterance

US11646011B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11646011-B2
Application numberUS-202217846287-A
CountryUS
Kind codeB2
Filing dateJun 22, 2022
Priority dateNov 28, 2018
Publication dateMay 9, 2023
Grant dateMay 9, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems for training and/or using a language selection model for use in determining a particular language of a spoken utterance captured in audio data. Features of the audio data can be processed using the trained language selection model to generate a predicted probability for each of N different languages, and a particular language selected based on the generated probabilities. Speech recognition results for the particular language can be utilized responsive to selecting the particular language of the spoken utterance. Many implementations are directed to training the language selection model utilizing tuple losses in lieu of traditional cross-entropy losses. Training the language selection model utilizing the tuple losses can result in more efficient training and/or can result in a more accurate and/or robust model—thereby mitigating erroneous language selections for spoken utterances.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented by one or more processors, the method comprising: generating a plurality of training examples, wherein generating each of the training examples is based on corresponding audio data that captures a corresponding human utterance, and a corresponding label that indicates a corresponding spoken language of the corresponding human utterance, the corresponding spoken language being one of N different languages to be recognized, wherein N is an integer that is greater than ten, and wherein each of the training examples comprises: corresponding training example input comprising: corresponding features of the corresponding audio data, and corresponding training example output comprising: a corresponding labeled probability metric for each of the N different languages to be recognized, wherein the corresponding labeled probability metrics include, based on the corresponding label, a corresponding positive probability metric label that corresponds to the corresponding spoken language, and a corresponding negative probability metric label for all other of the corresponding labeled probability metrics; training a language selection model based on the training examples, training the language selection model comprising: processing the corresponding features of the corresponding training example inputs of the training examples using the language selection model to generate corresponding predicted probabilities for each of the N different languages, generating corresponding tuple losses based on the generated corresponding predicted probabilities and the corresponding labeled probability metrics, and updating weights of the language selection model to generate an updated language selection model by using the generated corresponding tuple losses; and subsequent to training the language selection model: receiving, via at least one microphone of a computing device, current audio data that captures a current spoken utterance from a user; extracting one or more features of the current audio data that captures the current spoken utterance; processing, using the updated language selection model, the one or more features of the current audio data to generate current predicted probabilities for each of the N different languages; identifying M candidate languages for the spoken utterance based on data associated with the current audio data, the computing device, and/or the user, wherein the M candidate languages comprise two or more languages, and are a subset of the N different languages; selecting, from the M candidate languages, a current spoken language, wherein selecting the current spoken language is based on comparison of the current predicted probabilities for the M candidate languages; and performing speech-to-text processing of the current audio data based on the selected current spoken language. 2. The method of claim 1 , further comprising: receiving, in a transmission with the current audio data, an indication of the M candidate languages, wherein identifying the M candidate languages is based on the data associated with the current audio data, and wherein the data includes the indication of the M candidate languages that is received in the transmission with the current audio data. 3. The method of claim 1 , wherein identifying the M candidate languages for the spoken utterance is based on data associated with the current audio data. 4. The method of claim 3 , wherein identifying the M candidate languages for the spoken utterance is further based on data associated with the computing device and/or the user. 5. The method of claim 1 , wherein identifying the M candidate languages for the spoken utterance is based on data associated with the computing device. 6. The method of claim 5 , wherein identifying the M candidate languages for the spoken utterance is further based on data associated with the current audio data and/or the user. 7. The method of claim 1 , wherein identifying the M candidate languages for the spoken utterance is based on data associated with the user. 8. The method of claim 7 , wherein identifying the M candidate languages for the spoken utterance is further based on data associated with the current audio data and/or the computing device. 9. The method of claim 1 , further comprising: in response to identifying M candidate languages for the spoken utterance: initiating first speech-to-text processing of the current audio data using a first speech recognition model for a first candidate language of the M candidate languages, and initiating second speech-to-text processing of the current audio data using a second speech recognition model for a second candidate language of the M candidate languages. 10. The method of claim 9 , wherein selecting the current spoken language from the M candidate languages is simultaneous with the first speech-to-text processing and the second speech-to-text processing. 11. The method of claim 10 , wherein selecting the current spoken language from the M candidate languages occurs prior to completion of the first speech-to-text processing and the second speech-to-text-processing. 12. The method of claim 11 , wherein performing speech-to-text processing of the current audio data based on the selected current spoken language comprises: in response to determining that the current spoken utterance is in the first candidate language: completing the first speech-to-text processing; and halting the second speech-to-text processing prior to completion of the second speech-to-text processing. 13. The method of claim 11 , wherein performing speech-to-text processing of the current audio data based on the selected current spoken language comprises: in response to determining that the current spoken utterance is in the second candidate language: completing the second speech-to-text processing; and halting the first speech-to-text processing prior to completion of the first speech-to-text processing. 14. The method of claim 1 , further comprising: causing output generated during the speech-to-text processing to be utilized in generating content. 15. The method of claim 14 , wherein causing the output generated during the speech-to-text processing to be utilized in generating the content comprises causing the output to be provided for visual presentation via a display of the computing device. 16. The method of claim 14 , wherein causing the output generated during the speech-to-text processing to be utilized in generating the content comprises causing the output to be utilized in generating responsive content that is responsive to the current spoken utterance captured in the current audio data. 17. A system comprising: at least one processor; and memory storing instructions that, when executed by the at least one processor, cause the at least processor to: generate a plurality of training examples, wherein generating each of the training examples is based on corresponding audio data that captures a corresponding human utterance, and a corresponding label that indicates a corresponding spoken language of the corresponding human utterance, the corresponding spoken language being one of N different languages to be recognized, wherein N is an integer that is greater than ten, and wherein each of the training examples comprises: corresponding training example input comprising: corresponding features of the corresponding audio data, and corresponding training example output comprising: a corresponding labeled probability metric for each of the N different languages to be recognized,

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • G10L15/005Primary

    Language recognition · CPC title

  • Backpropagation, e.g. using gradient descent · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11646011B2 cover?
Methods and systems for training and/or using a language selection model for use in determining a particular language of a spoken utterance captured in audio data. Features of the audio data can be processed using the trained language selection model to generate a predicted probability for each of N different languages, and a particular language selected based on the generated probabilities. Sp…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L15/005. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 09 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).