Server side hotwording
US-2024412734-A1 · Dec 12, 2024 · US
US9324323B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9324323-B1 |
| Application number | US-201213715139-A |
| Country | US |
| Kind code | B1 |
| Filing date | Dec 14, 2012 |
| Priority date | Jan 13, 2012 |
| Publication date | Apr 26, 2016 |
| Grant date | Apr 26, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Speech recognition techniques may include: receiving audio; identifying one or more topics associated with audio; identifying language models in a topic space that correspond to the one or more topics, where the language models are identified based on proximity of a representation of the audio to representations of other audio in the topic space; using the language models to generate recognition candidates for the audio, where the recognition candidates have scores associated therewith that are indicative of a likelihood of a recognition candidate matching the audio; and selecting a recognition candidate for the audio based on the scores.
Opening claim text (preview).
What is claimed is: 1. A method comprising: receiving audio; determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to each of the representations of one or more corresponding features of other items of content, wherein each of the representations of one or more corresponding features of other items of content is associated with two or more language models that are each associated with a different topic; determining, based at least on the proximities in the vector space of the representation of the one or more features of the audio to the representations of one or more corresponding features of other items of content, that the representation of the one or more features of the audio is proximate to a representation of one or more corresponding features of another item of content; identifying (i) the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, and, (ii) for each language model that is associated with the representation of the one or more corresponding features of the other item of content, a relevance of the topic associated with the language model to the other item of content; obtaining, for each of the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, (i) a transcription of the audio, and (ii) a speech recognizer confidence score; generating, for each transcription, an aggregated score based at least on (i) the speech recognizer confidence score for the transcription, (ii) the relevance of the topic associated with the language model for which the transcription was obtained to the other item of content, and (iii) the proximity of the representation of the one or more features of the audio to the representation of the one or more corresponding features of the other item of content; and selecting a particular transcription of the audio, from among the transcriptions, based at least on the aggregated scores. 2. The method of claim 1 , further comprising: classifying documents by topic; classifying other audio by topic based on transcriptions of the other audio; and using the documents and the transcriptions of the other audio as training data to train at least the language models that are each associated with a different topic. 3. The method of claim 1 , wherein determining that the representation of the one or more features of the audio is proximate to the representation of the one or more corresponding features of the other item of content comprises: mapping the representation of the one or more features of the audio into the vector space; and identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on a distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space. 4. The method of claim 3 , wherein identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on the distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space comprises: determining that the representation of the one or more features of the audio is within a range of the representation of the one or more corresponding features of the other item of content. 5. The method of claim 3 , wherein identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on the distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space comprises: determining that the distance is one of a predetermined number of closest distances between the representation of the one or more features of the audio and representations of one or more corresponding features of other items of content, wherein the representations of one or more corresponding features of other items of content include the representation of the one or more corresponding features of the other item of content. 6. The method of claim 3 , wherein the vector space is an n-dimensional topic space, and wherein the representation of the one or more features of the audio is an n-dimensional vector. 7. The method of claim 6 , wherein each of the dimensions of the n-dimensional topic space corresponds to a topic. 8. The method of claim 1 , comprising identifying one or more topics associated with the audio. 9. The method of claim 8 , wherein the one or more topics associated with the audio are identified based on metadata associated with the audio. 10. The method of claim 8 , wherein the one or more topics associated with the audio are identified based on a transcription of the audio that is generated using a general language model that is not topic-specific. 11. The method of claim 1 , wherein the representation of the one or more features of the audio comprises a vector representation of the one or more features of the audio, and wherein the representation of the one or more corresponding features of the other content comprises a vector representation of the one or more corresponding features of the other content. 12. The method of claim 1 , wherein the other item of content is audio content or written language content. 13. The method of claim 1 , wherein the topics that are each associated with a different language model are part of a topic hierarchy, at least one of the topics associated with a language model being at a higher level in the topic hierarchy than another one of the topics associated with a language model. 14. The method of claim 1 , wherein the representation of the one or more features of the audio comprises a vector representation of the one or more features of the audio in which the elements of the vector representation of the one or more features of the audio each indicate a relevance of the audio to a different topic, and wherein the representation of the one or more corresponding features of the other content comprises a vector representation of the one or more corresponding features of the other content in which the elements of the vector representation of the one or more corresponding features of the other content each indicate a relevance of the other item of content to a different topic. 15. One or more non-transitory machine-readable media storing instructions that are executable by one or more processing devices to perform operations comprising: receiving audio; determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to e
using context dependencies, e.g. language models · CPC title
Probabilistic grammars, e.g. word n-grams · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.