The invention claimed is:
1. A method of recognizing a user voice, the method comprising:
obtaining an audio signal segmented into a plurality of frame units;
determining an energy component for each filter bank by applying a filter bank distributed according to a preset scale to a frequency spectrum of the audio signal segmented into the plurality of frame units;
smoothing the determined energy component for each filter bank;
extracting a feature vector of the audio signal based on the smoothed energy component for each filter bank; and
recognizing the user voice in the audio signal by inputting the extracted feature vector to a voice recognition model,
wherein the obtaining of the audio signal comprises determining a window length of a window to segment into the plurality of frame units, and
wherein the window length of a window used when a vocal tract length perturbation (VTLP) is applied to the frequency spectrum of a training signal is different from the window length of a window used when a room impulse (RIR) filter is adopted and the window length of a window used when actually obtained user voice in the audio signal is recognized.
2. The method of claim 1 , wherein the obtaining of the audio signal comprises:
overlapping windows having the determined window length at predetermined window intervals; and
segmenting the audio signal into the plurality of frame units by using the overlapped windows.
3. The method of claim 1 , wherein the determining of the energy component for each filter bank comprises:
applying the distributed filter bank to the frequency spectrum of the audio signal;
converting a value of the frequency spectrum to which the filter bank is applied, to a log-scale; and
determining the energy component for each filter bank by using the value of the frequency spectrum that is converted to the log-scale.
4. The method of claim 1 , wherein the smoothing of the determined energy component for each filter bank comprises:
training, for each filter bank, a smoothing coefficient to smooth the energy component for each filter bank based on a uniformly distributed target energy component; and
smoothing the energy component for each filter bank by using the smoothing coefficient trained for each filter bank.
5. The method of claim 1 , wherein the smoothing of the determined energy component for each filter bank comprises:
generating a histogram related to a size of the energy component for each filter bank of the audio signal;
determining a mapping function to map the generated histogram to a target histogram in which the size of the energy component for each filter bank is uniformly distributed; and
smoothing the energy component for each filter bank by converting the energy component for each filter bank of an audio signal by using the determined mapping function.
6. The method of claim 1 , wherein the extracting of the feature vector of the audio signal comprises:
determining a discrete cosine transform (DCT) coefficient by performing discrete cosine transform on the smoothed energy component for each filter bank; and
extracting a feature vector comprising at least one of the determined DCT coefficients as an element.
7. The method of claim 1 , wherein the voice recognition model is pre-trained, based on the feature vector of an audio training signal re-synthesized by using the frequency spectrum of the audio training signal in which frequency axis of the frequency spectrum of the audio training signal obtained for each frame unit is transformed, to represent a variation of different vocal tract lengths of a plurality of speakers.
8. The method of claim 7 , wherein the frequency axis of the frequency spectrum of the audio training signal is transformed based on a warping coefficient randomly generated for each frame and a warping function to transform the frequency axis on the frequency spectrum of the audio training signal based on the warping coefficient.
9. The method of claim 7 , wherein the voice recognition model is pre-trained, based on the re-synthesized audio training signal to which a room impulse filter indicating an acoustic feature of the audio signal for each transfer path in a room in which the audio signal is transmitted is applied.
10. The method of claim 7 , wherein the re-synthesized audio training signal is generated by performing inverse fast Fourier transform on the frequency spectrum of the audio training signal in which the frequency axis is transformed and overlapping, on a time axis, the frequency spectrum of the audio training signal that is inverse fast Fourier transformed.
11. An electronic apparatus for recognizing a user voice, the electronic apparatus comprising:
a memory storing one or more instructions; and
a processor configured to execute the one or more instructions,
wherein the processor is further configured to, by executing the one or more instructions:
obtain an audio signal segmented into a plurality of frame units,
determine an energy component for each filter bank by applying a filter bank distributed according to a preset scale to a frequency spectrum of the audio signal segmented into the plurality of frame units,
smooth the determined energy component for each filter bank,
extract a feature vector of the audio signal based on the smoothed energy component for each filter bank, and
recognize the user voice in the audio signal by inputting the extracted feature vector to a voice recognition model,
wherein the processor is further configured to determine a window length of a window to segment into the plurality of frame units, and
wherein the window length of a window used when a vocal tract length perturbation (VTLP) is applied to the frequency spectrum of a training signal is different from the window length of a window used when a room impulse (RIR) filter is adopted and the window length of a window used when actually obtained user voice in the audio signal is recognized.
12. The electronic apparatus of claim 11 , wherein the processor is further configured to, by executing the one or more instructions:
train, for each filter bank, a smoothing coefficient to smooth the energy component for each filter bank based on a uniformly distributed target energy component, and
smooth the energy component for each filter bank by using the smoothing coefficient trained for each filter bank.
13. The electronic apparatus of claim 11 , wherein the processor is further configured to, by executing the one or more instructions:
generate a histogram related to a size of the energy component for each filter bank of the audio signal,
determine a mapping function to map the generated histogram to a target histogram in which the size of the energy component for each filter bank is uniformly distributed, and
smooth the energy component for each filter bank by converting the energy component for each filter bank of an audio signal by using the determined mapping function.
14. The electronic apparatus of claim 11 , wherein the voice recognition model is pre-trained, based on the feature vector of an audio training signal re-synthesized by using the frequency spectrum of the audio training signal in which frequency axis of the frequency spectrum of the audio training signal obtained for each frame unit is transformed, to represent a variation of different vocal tract lengths of a plurality of speakers.
15. A non-transitory computer-readable recording medium having recorded thereon a program for executing, on a computer, the method defined in claim 1 .