Speech recognition device, speech recognition method, and non-transitory computer-readable medium

US12579983B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12579983-B2
Application numberUS-202318453338-A
CountryUS
Kind codeB2
Filing dateAug 22, 2023
Priority dateSep 20, 2022
Publication dateMar 17, 2026
Grant dateMar 17, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech recognition device includes: an acquisition part, acquiring a speech signal; a speech feature amount calculation part, calculating a speech feature amount; a first speech recognition part, based on the speech feature amount, performing speech recognition using a learned first E2E model, attaching a first tag to a vocabulary portion of a specific class in text that is a recognition result, and outputting the same; a second speech recognition part, based on the speech feature amount, performing speech recognition using a learned second E2E model, attaching a second tag to a vocabulary portion of a specific class in a phoneme that is a recognition result, and outputting the same; a phoneme replacement part, replacing a vocabulary with the first tag with a phoneme with the second tag; and an output part, converting the phoneme with the second tag into text and outputting the same.

First claim

Opening claim text (preview).

What is claimed is: 1 . A speech recognition device, comprising: an acquisition part, acquiring a speech signal; a speech feature amount calculation part, calculating a speech feature amount of the acquired speech signal; a first speech recognition part, based on the speech feature amount, performing speech recognition using a first end-to-end model that has been learned, attaching a first tag to a vocabulary portion of a specific class in a text that is a recognition result, and outputting the vocabulary portion with the first tag; a second speech recognition part, based on the speech feature amount, performing speech recognition using a second end-to-end model that has been learned, attaching a second tag to a vocabulary portion of a specific class in a phoneme that is a recognition result, and outputting the phoneme with the second tag; a phoneme replacement part, replacing the vocabulary portion with the first tag in the text recognized by the first speech recognition part with the phoneme with the second tag; and an output part, converting the phoneme with the second tag obtained by replacement by the phoneme replacement part into a text and outputting the converted text, wherein, in response to a text with a highest similarity in a language model in which text and phonemes are associated has the similarity greater than a threshold, the output part converts and outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part; and in response to the text with highest similarity in a language model stored in a language model storage part has the similarity equal to or less than a threshold, the output part outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part as the text with the first tag recognized by the first speech recognition part as it is. 2 . The speech recognition device according to claim 1 , wherein the output part converts the phoneme with the second tag obtained by replacement by the phoneme replacement part into a text with a highest similarity in a language model in which text and phonemes are associated. 3 . The speech recognition device according to claim 1 , wherein the first end-to-end model is learned using a speech signal and text data for each utterance unit; and the second end-to-end model is learned using a speech signal and phoneme data for each utterance unit. 4 . The speech recognition device according to claim 2 , wherein the first end-to-end model is learned using a speech signal and text data for each utterance unit; and the second end-to-end model is learned using a speech signal and phoneme data for each utterance unit. 5 . The speech recognition device according to claim 1 , wherein, in response to there being a plurality of vocabulary portions of the specific class with the first tag in the text outputted by the first speech recognition part, the phoneme replacement part replaces a first vocabulary portion of the specific class with the first tag with the phoneme with the second tag. 6 . The speech recognition device according to claim 2 , wherein, in response to there being a plurality of vocabulary portions of the specific class with the first tag in the text outputted by the first speech recognition part, the phoneme replacement part replaces the first vocabulary portion of the specific class with the first tag with a phoneme with the second tag. 7 . The speech recognition device according to claim 1 , wherein the vocabulary portion of the specific class is at least one proper noun of a person's name, a department name, a product name, a model name, a part name, and a place name. 8 . The speech recognition device according to claim 2 , wherein the vocabulary portion of the specific class is at least one proper noun of a person's name, a department name, a product name, a model name, a part name, and a place name. 9 . A speech recognition method, comprising: by an acquisition part, acquiring a speech signal; by a speech feature amount calculation part, calculating a speech feature amount of the acquired speech signal; by a first speech recognition part, based on the speech feature amount, performing speech recognition using a first end-to-end model that has been learned, attaching a first tag to a vocabulary portion of a specific class in a text that is a recognition result, and outputting the vocabulary portion with the first tag; by a second speech recognition part, based on the speech feature amount, performing speech recognition using a second end-to-end model that has been learned, attaching a second tag to a vocabulary portion of a specific class in a phoneme that is a recognition result, and outputting the phoneme with the second tag; by a phoneme replacement part, replacing the vocabulary portion with the first tag in the text recognized by the first speech recognition part with a phoneme with the second tag; and by an output part, converting the phoneme with the second tag obtained by replacement by the phoneme replacement part into a text and outputting the converted text, wherein, in response to a text with a highest similarity in a language model in which text and phonemes are associated has the similarity greater than a threshold, the output part converts and outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part; and in response to the text with highest similarity in a language model stored in a language model storage part has the similarity equal to or less than a threshold, the output part outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part as the text with the first tag recognized by the first speech recognition part as it is. 10 . A non-transitory computer-readable medium storing a program, the program causing a computer to: acquire a speech signal; calculate a speech feature amount of the acquired speech signal; based on the speech feature amount, perform speech recognition using a first end-to-end model that has been learned, attach a first tag to a vocabulary portion of a specific class in a text that is a recognition result, and output the vocabulary portion with the first tag; based on the speech feature amount, perform speech recognition using a second end-to-end model that has been learned, attach a second tag to a vocabulary portion of a specific class in a phoneme that is a recognition result, and output the phoneme with the second tag; replace the vocabulary portion with the first tag in the text recognized using the first end-to-end model with the phoneme with the second tag; and convert the phoneme with the second tag obtained by replacement into a text and output the converted text, wherein, in response to a text with a highest similarity in a language model in which text and phonemes are associated has the similarity greater than a threshold, the computer converts and outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part; and in response to the text with highest similarity in a language model stored in a language model storage part has the similarity equal to or less than a threshold, the computer outputs the phoneme with the second tag obtained by replacement by the phoneme replacement part as the text with the first tag recognized by the first speech recognition part as it is.

Assignees

Inventors

Classifications

  • Editing, e.g. inserting or deleting · CPC title

  • Phonemes, fenemes or fenones being the recognition units · CPC title

  • Formal grammars, e.g. finite state automata, context free grammars or word networks · CPC title

  • G10L15/26Primary

    Speech to text systems (G10L15/08 takes precedence) · CPC title

  • G10L15/32Primary

    Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12579983B2 cover?
A speech recognition device includes: an acquisition part, acquiring a speech signal; a speech feature amount calculation part, calculating a speech feature amount; a first speech recognition part, based on the speech feature amount, performing speech recognition using a learned first E2E model, attaching a first tag to a vocabulary portion of a specific class in text that is a recognition resu…
Who is the assignee on this patent?
Honda Motor Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L15/26. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 17 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).