Method and apparatus for detecting voice end point using acoustic and language modeling information for robust voice recognition
US-2022230627-A1 · Jul 21, 2022 · US
US12482459B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12482459-B2 |
| Application number | US-202217897352-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 29, 2022 |
| Priority date | Aug 29, 2022 |
| Publication date | Nov 25, 2025 |
| Grant date | Nov 25, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The speech recognition that is disclosed analyzes an acoustic feature for each subframe of an audio signal; provides a first model configured to determine a hidden state for each frame consisting of multiple subframes on the basis of the acoustic feature; provides a second model configured to determine a hidden state for each frame consisting of multiple subframes on the basis of the acoustic feature; and provides a third model configured to determine an utterance content on the basis of a sequence of the hidden states of each block consisting of multiple frames belonging to a voice segment.
Opening claim text (preview).
What is claimed is: 1 . A speech recognition system comprising: a processor and a memory, the processor coupled to the memory, the processor is configured to: input an audio signal; calculate an acoustic feature for each subframe of the audio signal; calculate, by using a first model, a hidden state series for each frame consisting of multiple subframes on the basis of the acoustic feature; specify, by using a second model, whether a voice segment or a non-voice segment for each block on the basis of the hidden state series, the block consisting of a plurality of frames; calculate, by using a third model, a probability for an utterance content candidate on the basis of a sequence of the hidden state provided series for each block having a single voice segment to specify an utterance content; and train the third model to calculate the probability for the utterance content candidate on the basis of hidden state series; wherein the processor is configured to: specify a first frame subsequent to the non-voice segment as a beginning of the voice segment, specify a second frame prior to a succeeding non-voice segment as an end of the voice segment, adjust block arrangement of the audio signal, by concatenating one or more frames up to the end of the voice segment in a first block with the end of the voice segment to a second block proceeding to the first block, concatenating one or more frames from the beginning of the voice segment in a third block with the beginning of the voice segment to a fourth block subsequent to the third block, search for recognition results indicating an utterance content for each block arrangement of the audio signal based on the probability for the utterance content candidate calculated by using the third model, and output the recognition results to an external device. 2 . The speech recognition system according to claim 1 , wherein the processor is configure to divide, by using the second model, block comprising two or more voice segments into two or more blocks, the two or more blocks containing respective voice segments. 3 . The speech recognition system according to claim 2 , wherein the processor is configured to calculate for each frame a probability that the frame belongs to a voice segment on the basis of the hidden state, the probability being as a voice segment probability; specify, a segment having consecutive inactive frames in which the number of inactive frames is more than a predetermined threshold frame number as the non-voice segment, each of the inactive frames having voice segment probability equals to or less than a predetermined probability threshold; and specify a segment having the consecutive inactive frames as voice segments. 4 . The speech recognition system according to claim 1 , wherein the first model comprises a first-stage model and a second-stage model, the first-stage model being designed for converting an acoustic feature for each subframe to a frame feature for each frame, and the second-stage model being designed for specifying the hidden state series on the basis of the frame feature. 5 . The speech recognition system according to claim 1 , wherein the third model is designed for calculating an estimated probability for each candidate of the utterance content corresponding to a hidden state series up to the latest block forming a voice segment, and specifying the utterance content with the highest estimated probability. 6 . A non-transitory computer-readable medium storing instructions at a speech recognition system, the instructions executed by a processor cause the speech recognition system to: input an audio signal; calculate an acoustic feature for each subframe of the audio signal; calculate, by using a first model, a hidden state series for each frame consisting of multiple subframes on the basis of the acoustic feature; specify, by using a second model, whether a voice segment or a non-voice segment for each block on the basis of the hidden state series, the block consisting of a plurality of frames; calculate, by using a third model, a probability for an utterance content candidate on the basis of a sequence of the hidden state series provided for each block having a single voice segment to specify an utterance content; and train the third model to calculate the probability for the utterance content candidate on the basis of hidden state series; wherein the instructions cause the speech recognition system to: specify a first frame subsequent to the non-voice segment as a beginning of the voice segment, specify a second frame prior to a succeeding non-voice segment as an end of the voice segment, adjust block arrangement of the audio signal, by concatenating one or more frames up to the end of the voice segment in a first block with the end of the voice segment to a second block preceding to the first block, concatenating one or more frames from the beginning of the voice segment in a third block with the beginning of the voice segment to a fourth block subsequent to the third block, search for recognition results indicating an utterance content for each block arrangement of the audio signal based on the probability for the utterance content candidate calculated by using the third model, and output the recognition results to an external device. 7 . A method for speech recognition, comprising the steps of: inputting an audio signal; calculating an acoustic feature for each subframe of the audio signal; calculating, by using a first model, a hidden state series for each frame consisting of multiple subframes on the basis of the acoustic feature; specifying, by using a second model, whether a voice segment or a non-voice segment for each block on the basis of the hidden state series, the block consisting of a plurality of frames; calculating, by using a third model, a probability for an utterance content candidate on the basis of a sequence of the hidden state series provided for each block having a single voice segment to specify an utterance content; and training the third model to calculate the probability for the utterance content candidate on the basis of hidden state series; further comprising the steps of: specifying a first frame subsequent to the non-voice segment as a beginning of the voice segment, specifying a second frame prior to a succeeding non-voice segment as an end if the voice segment, adjusting block arrangement of the audio signal, by concatenating one or more frames up to the end of the voice segment in a first block with the end of the voice segment to a second block preceding to the first block, concatenating one or more frames from the beginning of the voice segment in a third block with the beginning of the voice segment to a fourth block subsequent to the third block, searching for recognition results indicating an utterance content for each of arranged blocks of the audio signal based on the probability calculated by using the third model, and outputting the recognition results to an external device.
Segmentation; Word boundary detection · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Detection of presence or absence of voice signals (switching of direction of transmission by voice frequency in two-way loud-speaking telephone systems H04M9/10) · CPC title
using artificial neural networks · CPC title
Probabilistic grammars, e.g. word n-grams · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.