Speech recognition device, search device, speech recognition method, search method, and program
US-2022108699-A1 · Apr 7, 2022 · US
US12579983B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12579983-B2 |
| Application number | US-202318453338-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 22, 2023 |
| Priority date | Sep 20, 2022 |
| Publication date | Mar 17, 2026 |
| Grant date | Mar 17, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A speech recognition device includes: an acquisition part, acquiring a speech signal; a speech feature amount calculation part, calculating a speech feature amount; a first speech recognition part, based on the speech feature amount, performing speech recognition using a learned first E2E model, attaching a first tag to a vocabulary portion of a specific class in text that is a recognition result, and outputting the same; a second speech recognition part, based on the speech feature amount, performing speech recognition using a learned second E2E model, attaching a second tag to a vocabulary portion of a specific class in a phoneme that is a recognition result, and outputting the same; a phoneme replacement part, replacing a vocabulary with the first tag with a phoneme with the second tag; and an output part, converting the phoneme with the second tag into text and outputting the same.
Opening claim text (preview).
What is claimed is: 1 . A speech recognition device, comprising: an acquisition part, acquiring a speech signal; a speech feature amount calculation part, calculating a speech feature amount of the acquired speech signal; a first speech recognition part, based on the speech feature amount, performing speech recognition using a first end-to-end model that has been learned, attaching a first tag to a vocabulary portion of a specific class in a text that is a recognition result, and outputting the vocabulary portion with the first tag; a second speech recognition part, based on the speech feature amount, performing speech recognition using a second end-to-end model that has been learned, attaching a second tag to a vocabulary portion of a specific class in a phoneme that is a recognition result, and outputting the phoneme with the second tag; a phoneme replacement part, replacing the vocabulary portion with the first tag in the text recognized by the first speech recognition part with the phoneme with the second tag; and an output part, converting the phoneme with the second tag obtained by replacement by the phoneme replacement part into a text and outputting the converted text, wherein, in response to a text with a highest similarity in a language model in which text and phonemes are associated has the similarity greater than a threshold, the output part converts and outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part; and in response to the text with highest similarity in a language model stored in a language model storage part has the similarity equal to or less than a threshold, the output part outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part as the text with the first tag recognized by the first speech recognition part as it is. 2 . The speech recognition device according to claim 1 , wherein the output part converts the phoneme with the second tag obtained by replacement by the phoneme replacement part into a text with a highest similarity in a language model in which text and phonemes are associated. 3 . The speech recognition device according to claim 1 , wherein the first end-to-end model is learned using a speech signal and text data for each utterance unit; and the second end-to-end model is learned using a speech signal and phoneme data for each utterance unit. 4 . The speech recognition device according to claim 2 , wherein the first end-to-end model is learned using a speech signal and text data for each utterance unit; and the second end-to-end model is learned using a speech signal and phoneme data for each utterance unit. 5 . The speech recognition device according to claim 1 , wherein, in response to there being a plurality of vocabulary portions of the specific class with the first tag in the text outputted by the first speech recognition part, the phoneme replacement part replaces a first vocabulary portion of the specific class with the first tag with the phoneme with the second tag. 6 . The speech recognition device according to claim 2 , wherein, in response to there being a plurality of vocabulary portions of the specific class with the first tag in the text outputted by the first speech recognition part, the phoneme replacement part replaces the first vocabulary portion of the specific class with the first tag with a phoneme with the second tag. 7 . The speech recognition device according to claim 1 , wherein the vocabulary portion of the specific class is at least one proper noun of a person's name, a department name, a product name, a model name, a part name, and a place name. 8 . The speech recognition device according to claim 2 , wherein the vocabulary portion of the specific class is at least one proper noun of a person's name, a department name, a product name, a model name, a part name, and a place name. 9 . A speech recognition method, comprising: by an acquisition part, acquiring a speech signal; by a speech feature amount calculation part, calculating a speech feature amount of the acquired speech signal; by a first speech recognition part, based on the speech feature amount, performing speech recognition using a first end-to-end model that has been learned, attaching a first tag to a vocabulary portion of a specific class in a text that is a recognition result, and outputting the vocabulary portion with the first tag; by a second speech recognition part, based on the speech feature amount, performing speech recognition using a second end-to-end model that has been learned, attaching a second tag to a vocabulary portion of a specific class in a phoneme that is a recognition result, and outputting the phoneme with the second tag; by a phoneme replacement part, replacing the vocabulary portion with the first tag in the text recognized by the first speech recognition part with a phoneme with the second tag; and by an output part, converting the phoneme with the second tag obtained by replacement by the phoneme replacement part into a text and outputting the converted text, wherein, in response to a text with a highest similarity in a language model in which text and phonemes are associated has the similarity greater than a threshold, the output part converts and outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part; and in response to the text with highest similarity in a language model stored in a language model storage part has the similarity equal to or less than a threshold, the output part outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part as the text with the first tag recognized by the first speech recognition part as it is. 10 . A non-transitory computer-readable medium storing a program, the program causing a computer to: acquire a speech signal; calculate a speech feature amount of the acquired speech signal; based on the speech feature amount, perform speech recognition using a first end-to-end model that has been learned, attach a first tag to a vocabulary portion of a specific class in a text that is a recognition result, and output the vocabulary portion with the first tag; based on the speech feature amount, perform speech recognition using a second end-to-end model that has been learned, attach a second tag to a vocabulary portion of a specific class in a phoneme that is a recognition result, and output the phoneme with the second tag; replace the vocabulary portion with the first tag in the text recognized using the first end-to-end model with the phoneme with the second tag; and convert the phoneme with the second tag obtained by replacement into a text and output the converted text, wherein, in response to a text with a highest similarity in a language model in which text and phonemes are associated has the similarity greater than a threshold, the computer converts and outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part; and in response to the text with highest similarity in a language model stored in a language model storage part has the similarity equal to or less than a threshold, the computer outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part as the text with the first tag recognized by the first speech recognition part as it is.
Editing, e.g. inserting or deleting · CPC title
Phonemes, fenemes or fenones being the recognition units · CPC title
Formal grammars, e.g. finite state automata, context free grammars or word networks · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.