Speech recognition using topic-specific language models

US9324323B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9324323-B1
Application numberUS-201213715139-A
CountryUS
Kind codeB1
Filing dateDec 14, 2012
Priority dateJan 13, 2012
Publication dateApr 26, 2016
Grant dateApr 26, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Speech recognition techniques may include: receiving audio; identifying one or more topics associated with audio; identifying language models in a topic space that correspond to the one or more topics, where the language models are identified based on proximity of a representation of the audio to representations of other audio in the topic space; using the language models to generate recognition candidates for the audio, where the recognition candidates have scores associated therewith that are indicative of a likelihood of a recognition candidate matching the audio; and selecting a recognition candidate for the audio based on the scores.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving audio; determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to each of the representations of one or more corresponding features of other items of content, wherein each of the representations of one or more corresponding features of other items of content is associated with two or more language models that are each associated with a different topic; determining, based at least on the proximities in the vector space of the representation of the one or more features of the audio to the representations of one or more corresponding features of other items of content, that the representation of the one or more features of the audio is proximate to a representation of one or more corresponding features of another item of content; identifying (i) the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, and, (ii) for each language model that is associated with the representation of the one or more corresponding features of the other item of content, a relevance of the topic associated with the language model to the other item of content; obtaining, for each of the language models that are associated with the representation of the one or more corresponding features of the other item of content that is indicated as proximate to the representation of the one or more features of the audio, (i) a transcription of the audio, and (ii) a speech recognizer confidence score; generating, for each transcription, an aggregated score based at least on (i) the speech recognizer confidence score for the transcription, (ii) the relevance of the topic associated with the language model for which the transcription was obtained to the other item of content, and (iii) the proximity of the representation of the one or more features of the audio to the representation of the one or more corresponding features of the other item of content; and selecting a particular transcription of the audio, from among the transcriptions, based at least on the aggregated scores. 2. The method of claim 1 , further comprising: classifying documents by topic; classifying other audio by topic based on transcriptions of the other audio; and using the documents and the transcriptions of the other audio as training data to train at least the language models that are each associated with a different topic. 3. The method of claim 1 , wherein determining that the representation of the one or more features of the audio is proximate to the representation of the one or more corresponding features of the other item of content comprises: mapping the representation of the one or more features of the audio into the vector space; and identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on a distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space. 4. The method of claim 3 , wherein identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on the distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space comprises: determining that the representation of the one or more features of the audio is within a range of the representation of the one or more corresponding features of the other item of content. 5. The method of claim 3 , wherein identifying the representation of the one or more features of the audio as proximate to the representation of the one or more corresponding features of the other item of content based at least on the distance between the representation of the one or more features of the audio and the representation of the one or more corresponding features of the other item of content in the vector space comprises: determining that the distance is one of a predetermined number of closest distances between the representation of the one or more features of the audio and representations of one or more corresponding features of other items of content, wherein the representations of one or more corresponding features of other items of content include the representation of the one or more corresponding features of the other item of content. 6. The method of claim 3 , wherein the vector space is an n-dimensional topic space, and wherein the representation of the one or more features of the audio is an n-dimensional vector. 7. The method of claim 6 , wherein each of the dimensions of the n-dimensional topic space corresponds to a topic. 8. The method of claim 1 , comprising identifying one or more topics associated with the audio. 9. The method of claim 8 , wherein the one or more topics associated with the audio are identified based on metadata associated with the audio. 10. The method of claim 8 , wherein the one or more topics associated with the audio are identified based on a transcription of the audio that is generated using a general language model that is not topic-specific. 11. The method of claim 1 , wherein the representation of the one or more features of the audio comprises a vector representation of the one or more features of the audio, and wherein the representation of the one or more corresponding features of the other content comprises a vector representation of the one or more corresponding features of the other content. 12. The method of claim 1 , wherein the other item of content is audio content or written language content. 13. The method of claim 1 , wherein the topics that are each associated with a different language model are part of a topic hierarchy, at least one of the topics associated with a language model being at a higher level in the topic hierarchy than another one of the topics associated with a language model. 14. The method of claim 1 , wherein the representation of the one or more features of the audio comprises a vector representation of the one or more features of the audio in which the elements of the vector representation of the one or more features of the audio each indicate a relevance of the audio to a different topic, and wherein the representation of the one or more corresponding features of the other content comprises a vector representation of the one or more corresponding features of the other content in which the elements of the vector representation of the one or more corresponding features of the other content each indicate a relevance of the other item of content to a different topic. 15. One or more non-transitory machine-readable media storing instructions that are executable by one or more processing devices to perform operations comprising: receiving audio; determining, based at least on comparing a representation of one or more features of the audio to a set of representations of one or more corresponding features of other items of content, a proximity in a vector space of the representation of the one or more features of the audio to e

Assignees

Inventors

Classifications

  • G10L15/183Primary

    using context dependencies, e.g. language models · CPC title

  • Probabilistic grammars, e.g. word n-grams · CPC title

  • G10L15/26Primary

    Speech to text systems (G10L15/08 takes precedence) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9324323B1 cover?
Speech recognition techniques may include: receiving audio; identifying one or more topics associated with audio; identifying language models in a topic space that correspond to the one or more topics, where the language models are identified based on proximity of a representation of the audio to representations of other audio in the topic space; using the language models to generate recognitio…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/183. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 26 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).