What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Using context information with end-to-end models for speech recognition

US11545142B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11545142-B2
Application number	US-202016827937-A
Country	US
Kind code	B2
Filing date	Mar 24, 2020
Priority date	May 10, 2019
Publication date	Jan 3, 2023
Grant date	Jan 3, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving audio data encoding an utterance, processing, using a speech recognition model, the audio data to generate speech recognition scores for speech elements, and determining context scores for the speech elements based on context data indicating a context for the utterance. The method also includes executing, using the speech recognition scores and the context scores, a beam search decoding process to determine one or more candidate transcriptions for the utterance. The method also includes selecting a transcription for the utterance from the one or more candidate transcriptions.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, at data processing hardware, audio data encoding an utterance; processing, by the data processing hardware, using a speech recognition model, the audio data to generate speech recognition scores for speech elements; determining, by the data processing hardware, that a preliminary transcription for the utterance comprises a word that represents a prefix element, the prefix element indicating that a next element corresponds to a context for the utterance; selecting, by the data processing hardware, a first contextual finite-state transducer (FST) from a plurality of contextual FSTs based on a particular context corresponding to the first contextual FST matching the context for the utterance indicated by the word of the preliminary transcription determined to represent the prefix element of the utterance, wherein each contextual FST in the plurality of contextual FSTs corresponds to a respective different particular context for a same user that spoke the utterance; determining, by the data processing hardware, using the first contextual FST, context scores for the speech elements; executing, by the data processing hardware, using the speech recognition scores and the context scores, a beam search decoding process to determine one or more candidate transcriptions for the utterance; and selecting, by the data processing hardware, a transcription for the utterance from the one or more candidate transcriptions. 2. The method of claim 1 , wherein, during execution of the beam search decoding process, the context scores are configured to adjust a likelihood of the one or more candidate transcriptions before pruning any of the one or more candidate transcriptions from evaluation. 3. The method of claim 1 , wherein executing the beam search decoding process comprises using the context scores to prune paths through a speech recognition lattice to determine the one or more candidate transcriptions for the utterance. 4. The method of claim 1 , further comprising, prior to receiving the audio data encoding the utterance: generating, by the data processing hardware, the plurality of contextual FSTs to each represent a different set of words or phrases in a personalized data collection of the same user that spoke the utterance; and storing, by the data processing hardware, the plurality of contextual FSTs in memory hardware in communication with the data processing hardware. 5. The method of claim 4 , wherein the personalized data collection comprises a contacts list for the same user. 6. The method of claim 4 , wherein the personalized data collection comprises a media library for the same user. 7. The method of claim 4 , wherein the personalized data collection comprises a list of applications installed on a user device associated with the same user. 8. The method of claim 4 , further comprising, for each of at least one contextual FST in the plurality of contextual FSTs: generating, by the data processing hardware, a corresponding prefix FST comprising a set of one or more prefixes each corresponding to the respective different particular context of the corresponding contextual FST; and storing, by the data processing hardware, the corresponding prefix FST generated for the at least one contextual FST in the plurality of contextual FSTs. 9. The method of claim 1 , wherein the data processing hardware: resides on a user device associated with the same user that spoke the utterance; and executes the speech recognition model. 10. The method of claim 1 , wherein the speech recognition model comprises an end-to-end speech recognition model. 11. The method of claim 10 , wherein the end-to-end speech recognition model comprises a recurrent neural network-transducer (RNN-T). 12. The method of claim 1 , wherein the plurality of contextual FSTs represent contextual terms using elements representing subword units. 13. The method of claim 1 , wherein the plurality of contextual FSTs comprise: transition weights configured to bias transitions between subword units of a contextual term; and backoff arcs having offsetting weights configured to undo the biasing effect of the transition weight. 14. The method of claim 1 , wherein the speech elements comprise wordpieces or graphemes. 15. A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving audio data encoding an utterance; processing, using a speech recognition model, the audio data to generate speech recognition scores for speech elements; determining that a preliminary transcription for the utterance comprises a word that represents a prefix element, the prefix element indicating that a next element corresponds to a context for the utterance; selecting a first contextual finite-state transducer (FST) from a plurality of contextual FSTs based on a particular context corresponding to the first contextual FST matching the context for the utterance indicated by the word of the preliminary transcription determined to represent the prefix element of the utterance, wherein each contextual FST in the plurality of contextual FSTs corresponds to a respective different particular context for a same user that spoke the utterance; determining, using the first FST, context scores for the speech elements; executing, using the speech recognition scores and the context scores, a beam search decoding process to determine one or more candidate transcriptions for the utterance; and selecting a transcription for the utterance from the one or more candidate transcriptions. 16. The system of claim 15 , wherein, during execution of the beam search decoding process, the context scores are configured to adjust a likelihood of the one or more candidate transcriptions before pruning any of the one or more candidate transcriptions from evaluation. 17. The system of claim 15 , wherein executing the beam search decoding process comprises using the context scores to prune paths through a speech recognition lattice to determine the one or more candidate transcriptions for the utterance. 18. The system of claim 15 , wherein the operations further comprise, prior to receiving the audio data encoding the utterance: generating the plurality of contextual FSTs to each represent a different set of words or phrases in a personalized data collection of the same user that spoke the utterance; and storing the plurality of contextual FSTs in the memory hardware in communication with the data processing hardware. 19. The system of claim 18 , wherein the personalized data collection comprises a contacts list for the same user. 20. The system of claim 18 , wherein the personalized data collection comprises a media library for the same user. 21. The system of claim 18 , wherein the personalized data collection comprises a list of applications installed on a user device associated with the same user. 22. The system of claim 18 , wherein the operations further comprise, for each of at least one contextual FST in the plurality of contextual FSTs: generating a corresponding prefix FST comprising a set of one or more prefixes each corresponding to the respective different particular context of the corresponding contextual FST; and storing the corresponding prefix FST generated for the a

Assignees

Google Llc

Inventors

Classifications

G10L15/22
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
G06F18/2113
by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation · CPC title
G06N3/08
Learning methods · CPC title
G10L2015/228
of application context · CPC title
G10L15/16Primary
using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 70286001

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11545142B2 cover?: A method includes receiving audio data encoding an utterance, processing, using a speech recognition model, the audio data to generate speech recognition scores for speech elements, and determining context scores for the speech elements based on context data indicating a context for the utterance. The method also includes executing, using the speech recognition scores and the context scores, a …
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

System and Method for End-to-End speech recognition

Speech parsing with intelligent assistant

Customized speech processing language models

Fine-grained natural language understanding

Applying neural network language models to weighted finite state transducers for automatic speech recognition

Speech recognition with combined grammar and statistical language models

Systems and methods for speech transcription

Frequently asked questions