Adaptive audio enhancement for multichannel speech recognition
US-2018197534-A1 · Jul 12, 2018 · US
US2016284347A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2016284347-A1 |
| Application number | US-201615080927-A |
| Country | US |
| Kind code | A1 |
| Filing date | Mar 25, 2016 |
| Priority date | Mar 27, 2015 |
| Publication date | Sep 29, 2016 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing audio waveforms. In some implementations, a time-frequency feature representation is generated based on audio data. The time-frequency feature representation is input to an acoustic model comprising a trained artificial neural network. The trained artificial neural network comprising a frequency convolution layer, a memory layer, and one or more hidden layers. An output that is based on output of the trained artificial neural network is received. A transcription is provided, where the transcription is determined based on the output of the acoustic model.
Opening claim text (preview).
What is claimed is: 1 . A system comprising: one or more computers and one or more data storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: generating a time-frequency feature representation based on audio data; inputting the time-frequency feature representation to an acoustic model comprising a trained artificial neural network, the trained artificial neural network comprising a frequency convolution layer, a memory layer, and one or more hidden layers; receiving, from the acoustic model, an output that is based on output of the trained artificial neural network and that is indicative of a likelihood that the audio data corresponds to a phonetic unit; and providing a transcription for the audio data that is determined based on the output of the acoustic model. 2 . The system of claim 1 , wherein generating the time-frequency feature representation based on audio data comprises generating feature values by convolving samples of audio waveform data with one or more filters in the time domain; and wherein the memory layer comprises a long short-term memory layer. 3 . The system of claim 2 , wherein the acoustic model comprises multiple long short-term memory layers, and wherein the trained artificial neural network is configured such that output of at least one of the long short-term memory layers is input to another of the long short-term memory layers. 4 . The system of claim 1 , wherein the artificial neural network is an artificial neural network in which: a first long short-term memory layer receives input from the frequency convolution layer, the first long short-term memory layer provides output to a series of one or more other long short-term memory layers, and the output from the series of one or more other long short-term memory layers is provided to a series of multiple hidden neural network layers. 5 . The system of claim 1 , wherein the operations further comprise receiving the audio data from a client device over a network; wherein providing the transcription for the audio data comprises providing the transcription to the client device over the network, for display at the client device. 6 . The system of claim 1 , wherein generating the time-frequency feature representation comprises: convolving time-domain features of audio waveform samples with each of a plurality of finite impulse response filters; and time averaging the results of the convolution over a particular time window. 7 . The system of claim 1 , wherein generating the time-frequency feature representation comprises: generating the time-frequency feature representation using a set of multiple learned filters that were trained jointly with the artificial neural network of the acoustic model. 8 . The system of claim 1 , wherein the operations further comprise: obtaining audio data that includes a plurality of audio waveform samples; and identifying a particular set of the audio waveform samples that occur within a time window; wherein generating the time-frequency representation comprises generating the time-frequency representation based on the particular set of audio waveform samples. 9 . The system of claim 8 , wherein identifying the particular set of the audio waveform samples that occur within the time window comprises identifying the audio waveform samples corresponding to a frame; and wherein generating the time-frequency feature representation based on the particular set of audio waveform samples comprises: convolving the audio waveform samples corresponding to the frame with each filter in a set of multiple finite impulse response filters in a filterbank; collapsing outputs of the filterbank using a pooling function to discard short-term phase information and generate an output for each of the filters with respect to the frame; applying a non-linear rectifying function to the collapsed filterbank outputs; applying a stabilized logarithm compression function to the rectified outputs; and determining, as the time-frequency feature representation, a frame-level feature vector comprising the outputs of the stabilized logarithm compression function. 10 . The system of claim 8 , wherein the operations further comprise: determining log-mel features based on the audio waveform samples that occur within the time window; and providing data indicating the log-mel features to the acoustic model; wherein receiving an output from the trained artificial neural network of the acoustic model comprises receiving an output from the trained artificial neural network that is based on (i) the time-frequency feature representation and (ii) the log-mel features. 11 . The system of claim 1 , wherein the output of the acoustic model indicates a likelihood that a portion of the utterance corresponding to the identified features represents a particular context-dependent state. 12 . The system of claim 11 , wherein the context-dependent state is a context-dependent hidden Markov model state corresponding to a phoneme or a portion of a phoneme. 13 . The system of claim 1 , wherein the artificial neural network has been trained using sequence training, cross-entropy training, or truncated backpropagation through time. 14 . The system of claim 1 , wherein the operations further comprise identifying, in the audio data, multiple different sets of audio waveform samples that occur in different consecutive time windows; and repeating the generating, inputting, and receiving steps for each of the multiple different sets of audio waveform samples to obtain an output of the artificial neural network for each of the different consecutive time windows; wherein determining the transcription for the utterance is comprises determining the transcription for the utterance based on the outputs of the trained artificial neural network for each of the different consecutive time windows. 15 . The system of claim 1 , wherein obtaining audio data corresponding to an utterance comprises receiving, over a computer network and from a client device, audio data representing an utterance detected by a microphone of the client device; and wherein providing the transcription comprises providing, over the computer network and to the client device, data indicating the transcription for display at a screen of the client device. 16 . The system of claim 1 , wherein the time-frequency feature representation is not a log-mel feature. 17 . A method performed by data processing apparatus, the method comprising: generating a time-frequency feature representation based on audio data; inputting the time-frequency feature representation to an acoustic model comprising a trained artificial neural network, the trained artificial neural network comprising a frequency convolution layer, a memory layer, and one or more hidden layers; receiving, from the acoustic model, an output that is based on output of the trained artificial neural network and that is indicative of a likelihood that the audio data corresponds to a phonetic unit; and providing a transcription for the audio data that is determined based on the output of the acoustic model. 18 . The method of claim 17 , wherein the trained artificial neural network comprises multiple long short-term memory layers, and wherein the output of at least one of the long short-term memory layers is input to another of the long short-term memory layers. 19 . A computer-readable storage de
Combinations of networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Backpropagation, e.g. using gradient descent · CPC title
Hidden Markov Models [HMMs] · CPC title
using artificial neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.