What technology area does this patent fall under?

Primary CPC classification G10L17/02. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Voice recognition device and method

US11961522B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11961522-B2
Application number	US-201917296806-A
Country	US
Kind code	B2
Filing date	Nov 22, 2019
Priority date	Nov 28, 2018
Publication date	Apr 16, 2024
Grant date	Apr 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosure relates to an electronic apparatus for recognizing user voice and a method of recognizing, by the electronic apparatus, the user voice. According to an embodiment, the method of recognizing the user voice includes obtaining an audio signal segmented into a plurality of frame units, determining an energy component for each filter bank by applying a filter bank distributed according to a preset scale to a frequency spectrum of the audio signal segmented into the frame units, smoothing the determined energy component for each filter bank, extracting a feature vector of the audio signal based on the smoothed energy component for each filter bank, and recognizing the user voice in the audio signal by inputting the extracted feature vector to a voice recognition model.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method of recognizing a user voice, the method comprising: obtaining an audio signal segmented into a plurality of frame units; determining an energy component for each filter bank by applying a filter bank distributed according to a preset scale to a frequency spectrum of the audio signal segmented into the plurality of frame units; smoothing the determined energy component for each filter bank; extracting a feature vector of the audio signal based on the smoothed energy component for each filter bank; and recognizing the user voice in the audio signal by inputting the extracted feature vector to a voice recognition model, wherein the obtaining of the audio signal comprises determining a window length of a window to segment into the plurality of frame units, and wherein the window length of a window used when a vocal tract length perturbation (VTLP) is applied to the frequency spectrum of a training signal is different from the window length of a window used when a room impulse (RIR) filter is adopted and the window length of a window used when actually obtained user voice in the audio signal is recognized. 2. The method of claim 1 , wherein the obtaining of the audio signal comprises: overlapping windows having the determined window length at predetermined window intervals; and segmenting the audio signal into the plurality of frame units by using the overlapped windows. 3. The method of claim 1 , wherein the determining of the energy component for each filter bank comprises: applying the distributed filter bank to the frequency spectrum of the audio signal; converting a value of the frequency spectrum to which the filter bank is applied, to a log-scale; and determining the energy component for each filter bank by using the value of the frequency spectrum that is converted to the log-scale. 4. The method of claim 1 , wherein the smoothing of the determined energy component for each filter bank comprises: training, for each filter bank, a smoothing coefficient to smooth the energy component for each filter bank based on a uniformly distributed target energy component; and smoothing the energy component for each filter bank by using the smoothing coefficient trained for each filter bank. 5. The method of claim 1 , wherein the smoothing of the determined energy component for each filter bank comprises: generating a histogram related to a size of the energy component for each filter bank of the audio signal; determining a mapping function to map the generated histogram to a target histogram in which the size of the energy component for each filter bank is uniformly distributed; and smoothing the energy component for each filter bank by converting the energy component for each filter bank of an audio signal by using the determined mapping function. 6. The method of claim 1 , wherein the extracting of the feature vector of the audio signal comprises: determining a discrete cosine transform (DCT) coefficient by performing discrete cosine transform on the smoothed energy component for each filter bank; and extracting a feature vector comprising at least one of the determined DCT coefficients as an element. 7. The method of claim 1 , wherein the voice recognition model is pre-trained, based on the feature vector of an audio training signal re-synthesized by using the frequency spectrum of the audio training signal in which frequency axis of the frequency spectrum of the audio training signal obtained for each frame unit is transformed, to represent a variation of different vocal tract lengths of a plurality of speakers. 8. The method of claim 7 , wherein the frequency axis of the frequency spectrum of the audio training signal is transformed based on a warping coefficient randomly generated for each frame and a warping function to transform the frequency axis on the frequency spectrum of the audio training signal based on the warping coefficient. 9. The method of claim 7 , wherein the voice recognition model is pre-trained, based on the re-synthesized audio training signal to which a room impulse filter indicating an acoustic feature of the audio signal for each transfer path in a room in which the audio signal is transmitted is applied. 10. The method of claim 7 , wherein the re-synthesized audio training signal is generated by performing inverse fast Fourier transform on the frequency spectrum of the audio training signal in which the frequency axis is transformed and overlapping, on a time axis, the frequency spectrum of the audio training signal that is inverse fast Fourier transformed. 11. An electronic apparatus for recognizing a user voice, the electronic apparatus comprising: a memory storing one or more instructions; and a processor configured to execute the one or more instructions, wherein the processor is further configured to, by executing the one or more instructions: obtain an audio signal segmented into a plurality of frame units, determine an energy component for each filter bank by applying a filter bank distributed according to a preset scale to a frequency spectrum of the audio signal segmented into the plurality of frame units, smooth the determined energy component for each filter bank, extract a feature vector of the audio signal based on the smoothed energy component for each filter bank, and recognize the user voice in the audio signal by inputting the extracted feature vector to a voice recognition model, wherein the processor is further configured to determine a window length of a window to segment into the plurality of frame units, and wherein the window length of a window used when a vocal tract length perturbation (VTLP) is applied to the frequency spectrum of a training signal is different from the window length of a window used when a room impulse (RIR) filter is adopted and the window length of a window used when actually obtained user voice in the audio signal is recognized. 12. The electronic apparatus of claim 11 , wherein the processor is further configured to, by executing the one or more instructions: train, for each filter bank, a smoothing coefficient to smooth the energy component for each filter bank based on a uniformly distributed target energy component, and smooth the energy component for each filter bank by using the smoothing coefficient trained for each filter bank. 13. The electronic apparatus of claim 11 , wherein the processor is further configured to, by executing the one or more instructions: generate a histogram related to a size of the energy component for each filter bank of the audio signal, determine a mapping function to map the generated histogram to a target histogram in which the size of the energy component for each filter bank is uniformly distributed, and smooth the energy component for each filter bank by converting the energy component for each filter bank of an audio signal by using the determined mapping function. 14. The electronic apparatus of claim 11 , wherein the voice recognition model is pre-trained, based on the feature vector of an audio training signal re-synthesized by using the frequency spectrum of the audio training signal in which frequency axis of the frequency spectrum of the audio training signal obtained for each frame unit is transformed, to represent a variation of different vocal tract lengths of a plurality of speakers. 15. A non-transitory computer-readable recording medium having recorded thereon a program for executing, on a computer, the method defined in claim 1 .

Assignees

Samsung Electronics Co Ltd

Inventors

Classifications

G10L17/02Primary
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
G10L17/04
Training, enrolment or model building · CPC title
G10L25/21
the extracted parameters being power information · CPC title
G10L17/18
Artificial neural networks; Connectionist approaches · CPC title
G10L17/20
Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions · CPC title

Patent family

Related publications grouped by family.

View patent family 70851868

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11961522B2 cover?: The disclosure relates to an electronic apparatus for recognizing user voice and a method of recognizing, by the electronic apparatus, the user voice. According to an embodiment, the method of recognizing the user voice includes obtaining an audio signal segmented into a plurality of frame units, determining an energy component for each filter bank by applying a filter bank distributed accordin…
Who is the assignee on this patent?: Samsung Electronics Co Ltd
What technology area does this patent fall under?: Primary CPC classification G10L17/02. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).