What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 27 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Speech to entity

US12315495B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12315495-B2
Application number	US-202117644970-A
Country	US
Kind code	B2
Filing date	Dec 17, 2021
Priority date	Dec 17, 2021
Publication date	May 27, 2025
Grant date	May 27, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods are provided for extracting entities from received speech. The systems and methods perform operations comprising receiving an audio file comprising speech input and processing, by a speech recognition engine, the audio file comprising the speech input to generate an initial character-based representation of the speech input. The operations further comprise processing, by an entity extractor, the initial character-based representation of the speech input to generate an estimated set of entities of the speech input. The operations further comprise generating, by the speech recognition engine, a textual representation of the speech input based on the estimated set of entities of the speech input.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, by one or more processors, an audio file comprising speech input; processing the audio file comprising the speech input to generate an initial character-based representation of the speech input; processing the initial character-based representation of the speech input to generate an estimated set of entities of the speech input, the estimated set of entities of the speech input being generated based on different linear combinations of probabilities of characters of the speech input over time, the different linear combinations comprising a matrix, rows of the matrix representing different time points of the speech input, and columns of the matrix representing different characters of a transcript of the speech input; and generating a textual representation of the speech input based on the estimated set of entities of the speech input, the initial character-based representation of the speech input being generated by a first neural network and the estimated set of entities of the speech input being generated by a second neural network, the first and second neural networks being arranged as a cascade neural network and trained in an end-to-end manner. 2. The method of claim 1 , further comprising: processing, by an intent classifier, the speech input to generate an estimated intent of the speech input. 3. The method of claim 2 , wherein processing the speech input by the intent classifier comprises: processing the initial character-based representation of the speech input to generate the estimated intent of the speech input. 4. The method of claim 2 , wherein the intent classifier comprises a third neural network, wherein the first, second and third neural networks are arranged as the cascade neural network trained in an end-to-end manner. 5. The method of claim 2 , further comprising: generating a list of possible entities associated with the estimated intent of the speech input. 6. The method of claim 1 , wherein for a first row of the rows of the matrix corresponding to a first time point in the speech input, each of the characters in the columns of matrix, associated with the first row, is associated with a respective probability representing a likelihood that the speech input corresponds to a respective one of the characters. 7. The method of claim 1 , further comprising streaming a linear combination of a subset of the rows of the matrix. 8. The method of claim 1 , further comprising: streaming a first linear combination of characters comprising a first subset of characters associated with a first set of respective probabilities for a first time point; and streaming a second linear combination of characters comprising a second subset of characters associated with a second set of respective probabilities for a second time point. 9. The method of claim 8 , further comprising generating the estimated set of entities of the speech input based on the first and second linear combinations of the characters. 10. The method of claim 8 , further comprising: computing a first likelihood that a first sequence of characters from the first and second linear combinations corresponds to a given entity; computing a second likelihood that a second sequence of characters from the first and second linear combinations corresponds to the given entity; comparing the first and second likelihoods to a threshold associated with the given entity; and determining that the first sequence of characters corresponds to the given entity in response to determining that the first likelihood transgresses the threshold and the second likelihood fails to transgress the threshold. 11. The method of claim 1 , further comprising: computing a loss function that comprises first and second cost functions, the first cost function being based on the estimated set of entities of the speech input, and the second cost function being based on the textual representation of the speech input; and updating parameters of the first and second neural networks based on the loss function, wherein gradients of the first cost function are back-propagated to the second cost function. 12. The method of claim 1 , further comprising: receiving training data comprising a plurality of training audio files, each of the plurality of training audio files being associated with a ground-truth set of entities and a ground-truth transcription, the ground-truth set of entities each being associated with a threshold indicating a likelihood that a combination of characters corresponds to a respective entity in the ground-truth set of entities; processing a first training audio file of the plurality of training audio files by the first neural network to generate a first initial character-based representation of the first training audio file; processing, by the second neural network, the first initial character-based representation of the first training audio file to generate a first estimated entity set using the thresholds associated with the ground-truth set of entities; and comparing the first estimated entity set to the ground-truth set of entities associated with the first training audio file to generate a first loss. 13. The method of claim 12 , further comprising processing a second training audio file of the plurality of training audio files to further update parameters of the first and second neural networks. 14. The method of claim 12 , further comprising updating parameters of the first neural network and the second neural network based on the first loss. 15. The method of claim 1 , wherein the second neural network generates the estimated set of entities of the speech input before a transcription of the speech input is generated. 16. A system comprising: at least one processor configured to perform operations comprising: receiving an audio file comprising speech input; processing the audio file comprising the speech input to generate an initial character-based representation of the speech input; processing the initial character-based representation of the speech input to generate an estimated set of entities of the speech input, the estimated set of entities of the speech input being generated based on different linear combinations of probabilities of characters of the speech input over time, the different linear combinations comprising a matrix, rows of the matrix representing different time points of the speech input, and columns of the matrix representing different characters of a transcript of the speech input; and generating a textual representation of the speech input based on the estimated set of entities of the speech input, the initial character-based representation of the speech input being generated by a first neural network and the estimated set of entities of the speech input being generated by a second neural network, the first and second neural networks being arranged as a cascade neural network and trained in an end-to-end manner. 17. A non-transitory machine-readable storage medium that includes instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: receiving an audio file comprising speech input; processing the audio file comprising the speech input to generate an initial character-based representation of the speech input; processing the initial character-based representation of the speech input to generate an estimated set of entities of the speech input, the estimated set of entities of the speech input being generated based on different linear combinations of proba

Assignees

Snap Inc

Inventors

Classifications

G10L21/10
Transforming into visible information · CPC title
G10L15/063
Training · CPC title
G06N3/08
Learning methods · CPC title
G06N3/045
Combinations of networks · CPC title
G10L2015/088
Word spotting · CPC title

Patent family

Related publications grouped by family.

View patent family 85036785

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12315495B2 cover?: Systems and methods are provided for extracting entities from received speech. The systems and methods perform operations comprising receiving an audio file comprising speech input and processing, by a speech recognition engine, the audio file comprising the speech input to generate an initial character-based representation of the speech input. The operations further comprise processing, by an …
Who is the assignee on this patent?: Snap Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 27 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).