Storage medium, communication terminal, and display method
US-2016234149-A1 · Aug 11, 2016 · US
US12315495B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12315495-B2 |
| Application number | US-202117644970-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 17, 2021 |
| Priority date | Dec 17, 2021 |
| Publication date | May 27, 2025 |
| Grant date | May 27, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods are provided for extracting entities from received speech. The systems and methods perform operations comprising receiving an audio file comprising speech input and processing, by a speech recognition engine, the audio file comprising the speech input to generate an initial character-based representation of the speech input. The operations further comprise processing, by an entity extractor, the initial character-based representation of the speech input to generate an estimated set of entities of the speech input. The operations further comprise generating, by the speech recognition engine, a textual representation of the speech input based on the estimated set of entities of the speech input.
Opening claim text (preview).
What is claimed is: 1. A method comprising: receiving, by one or more processors, an audio file comprising speech input; processing the audio file comprising the speech input to generate an initial character-based representation of the speech input; processing the initial character-based representation of the speech input to generate an estimated set of entities of the speech input, the estimated set of entities of the speech input being generated based on different linear combinations of probabilities of characters of the speech input over time, the different linear combinations comprising a matrix, rows of the matrix representing different time points of the speech input, and columns of the matrix representing different characters of a transcript of the speech input; and generating a textual representation of the speech input based on the estimated set of entities of the speech input, the initial character-based representation of the speech input being generated by a first neural network and the estimated set of entities of the speech input being generated by a second neural network, the first and second neural networks being arranged as a cascade neural network and trained in an end-to-end manner. 2. The method of claim 1 , further comprising: processing, by an intent classifier, the speech input to generate an estimated intent of the speech input. 3. The method of claim 2 , wherein processing the speech input by the intent classifier comprises: processing the initial character-based representation of the speech input to generate the estimated intent of the speech input. 4. The method of claim 2 , wherein the intent classifier comprises a third neural network, wherein the first, second and third neural networks are arranged as the cascade neural network trained in an end-to-end manner. 5. The method of claim 2 , further comprising: generating a list of possible entities associated with the estimated intent of the speech input. 6. The method of claim 1 , wherein for a first row of the rows of the matrix corresponding to a first time point in the speech input, each of the characters in the columns of matrix, associated with the first row, is associated with a respective probability representing a likelihood that the speech input corresponds to a respective one of the characters. 7. The method of claim 1 , further comprising streaming a linear combination of a subset of the rows of the matrix. 8. The method of claim 1 , further comprising: streaming a first linear combination of characters comprising a first subset of characters associated with a first set of respective probabilities for a first time point; and streaming a second linear combination of characters comprising a second subset of characters associated with a second set of respective probabilities for a second time point. 9. The method of claim 8 , further comprising generating the estimated set of entities of the speech input based on the first and second linear combinations of the characters. 10. The method of claim 8 , further comprising: computing a first likelihood that a first sequence of characters from the first and second linear combinations corresponds to a given entity; computing a second likelihood that a second sequence of characters from the first and second linear combinations corresponds to the given entity; comparing the first and second likelihoods to a threshold associated with the given entity; and determining that the first sequence of characters corresponds to the given entity in response to determining that the first likelihood transgresses the threshold and the second likelihood fails to transgress the threshold. 11. The method of claim 1 , further comprising: computing a loss function that comprises first and second cost functions, the first cost function being based on the estimated set of entities of the speech input, and the second cost function being based on the textual representation of the speech input; and updating parameters of the first and second neural networks based on the loss function, wherein gradients of the first cost function are back-propagated to the second cost function. 12. The method of claim 1 , further comprising: receiving training data comprising a plurality of training audio files, each of the plurality of training audio files being associated with a ground-truth set of entities and a ground-truth transcription, the ground-truth set of entities each being associated with a threshold indicating a likelihood that a combination of characters corresponds to a respective entity in the ground-truth set of entities; processing a first training audio file of the plurality of training audio files by the first neural network to generate a first initial character-based representation of the first training audio file; processing, by the second neural network, the first initial character-based representation of the first training audio file to generate a first estimated entity set using the thresholds associated with the ground-truth set of entities; and comparing the first estimated entity set to the ground-truth set of entities associated with the first training audio file to generate a first loss. 13. The method of claim 12 , further comprising processing a second training audio file of the plurality of training audio files to further update parameters of the first and second neural networks. 14. The method of claim 12 , further comprising updating parameters of the first neural network and the second neural network based on the first loss. 15. The method of claim 1 , wherein the second neural network generates the estimated set of entities of the speech input before a transcription of the speech input is generated. 16. A system comprising: at least one processor configured to perform operations comprising: receiving an audio file comprising speech input; processing the audio file comprising the speech input to generate an initial character-based representation of the speech input; processing the initial character-based representation of the speech input to generate an estimated set of entities of the speech input, the estimated set of entities of the speech input being generated based on different linear combinations of probabilities of characters of the speech input over time, the different linear combinations comprising a matrix, rows of the matrix representing different time points of the speech input, and columns of the matrix representing different characters of a transcript of the speech input; and generating a textual representation of the speech input based on the estimated set of entities of the speech input, the initial character-based representation of the speech input being generated by a first neural network and the estimated set of entities of the speech input being generated by a second neural network, the first and second neural networks being arranged as a cascade neural network and trained in an end-to-end manner. 17. A non-transitory machine-readable storage medium that includes instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: receiving an audio file comprising speech input; processing the audio file comprising the speech input to generate an initial character-based representation of the speech input; processing the initial character-based representation of the speech input to generate an estimated set of entities of the speech input, the estimated set of entities of the speech input being generated based on different linear combinations of proba
Transforming into visible information · CPC title
Training · CPC title
Learning methods · CPC title
Combinations of networks · CPC title
Word spotting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.