What technology area does this patent fall under?

Primary CPC classification G10L15/16. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Sep 29 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Processing audio waveforms

US2016284347A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2016284347-A1
Application number	US-201615080927-A
Country	US
Kind code	A1
Filing date	Mar 25, 2016
Priority date	Mar 27, 2015
Publication date	Sep 29, 2016
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing audio waveforms. In some implementations, a time-frequency feature representation is generated based on audio data. The time-frequency feature representation is input to an acoustic model comprising a trained artificial neural network. The trained artificial neural network comprising a frequency convolution layer, a memory layer, and one or more hidden layers. An output that is based on output of the trained artificial neural network is received. A transcription is provided, where the transcription is determined based on the output of the acoustic model.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system comprising: one or more computers and one or more data storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: generating a time-frequency feature representation based on audio data; inputting the time-frequency feature representation to an acoustic model comprising a trained artificial neural network, the trained artificial neural network comprising a frequency convolution layer, a memory layer, and one or more hidden layers; receiving, from the acoustic model, an output that is based on output of the trained artificial neural network and that is indicative of a likelihood that the audio data corresponds to a phonetic unit; and providing a transcription for the audio data that is determined based on the output of the acoustic model. 2 . The system of claim 1 , wherein generating the time-frequency feature representation based on audio data comprises generating feature values by convolving samples of audio waveform data with one or more filters in the time domain; and wherein the memory layer comprises a long short-term memory layer. 3 . The system of claim 2 , wherein the acoustic model comprises multiple long short-term memory layers, and wherein the trained artificial neural network is configured such that output of at least one of the long short-term memory layers is input to another of the long short-term memory layers. 4 . The system of claim 1 , wherein the artificial neural network is an artificial neural network in which: a first long short-term memory layer receives input from the frequency convolution layer, the first long short-term memory layer provides output to a series of one or more other long short-term memory layers, and the output from the series of one or more other long short-term memory layers is provided to a series of multiple hidden neural network layers. 5 . The system of claim 1 , wherein the operations further comprise receiving the audio data from a client device over a network; wherein providing the transcription for the audio data comprises providing the transcription to the client device over the network, for display at the client device. 6 . The system of claim 1 , wherein generating the time-frequency feature representation comprises: convolving time-domain features of audio waveform samples with each of a plurality of finite impulse response filters; and time averaging the results of the convolution over a particular time window. 7 . The system of claim 1 , wherein generating the time-frequency feature representation comprises: generating the time-frequency feature representation using a set of multiple learned filters that were trained jointly with the artificial neural network of the acoustic model. 8 . The system of claim 1 , wherein the operations further comprise: obtaining audio data that includes a plurality of audio waveform samples; and identifying a particular set of the audio waveform samples that occur within a time window; wherein generating the time-frequency representation comprises generating the time-frequency representation based on the particular set of audio waveform samples. 9 . The system of claim 8 , wherein identifying the particular set of the audio waveform samples that occur within the time window comprises identifying the audio waveform samples corresponding to a frame; and wherein generating the time-frequency feature representation based on the particular set of audio waveform samples comprises: convolving the audio waveform samples corresponding to the frame with each filter in a set of multiple finite impulse response filters in a filterbank; collapsing outputs of the filterbank using a pooling function to discard short-term phase information and generate an output for each of the filters with respect to the frame; applying a non-linear rectifying function to the collapsed filterbank outputs; applying a stabilized logarithm compression function to the rectified outputs; and determining, as the time-frequency feature representation, a frame-level feature vector comprising the outputs of the stabilized logarithm compression function. 10 . The system of claim 8 , wherein the operations further comprise: determining log-mel features based on the audio waveform samples that occur within the time window; and providing data indicating the log-mel features to the acoustic model; wherein receiving an output from the trained artificial neural network of the acoustic model comprises receiving an output from the trained artificial neural network that is based on (i) the time-frequency feature representation and (ii) the log-mel features. 11 . The system of claim 1 , wherein the output of the acoustic model indicates a likelihood that a portion of the utterance corresponding to the identified features represents a particular context-dependent state. 12 . The system of claim 11 , wherein the context-dependent state is a context-dependent hidden Markov model state corresponding to a phoneme or a portion of a phoneme. 13 . The system of claim 1 , wherein the artificial neural network has been trained using sequence training, cross-entropy training, or truncated backpropagation through time. 14 . The system of claim 1 , wherein the operations further comprise identifying, in the audio data, multiple different sets of audio waveform samples that occur in different consecutive time windows; and repeating the generating, inputting, and receiving steps for each of the multiple different sets of audio waveform samples to obtain an output of the artificial neural network for each of the different consecutive time windows; wherein determining the transcription for the utterance is comprises determining the transcription for the utterance based on the outputs of the trained artificial neural network for each of the different consecutive time windows. 15 . The system of claim 1 , wherein obtaining audio data corresponding to an utterance comprises receiving, over a computer network and from a client device, audio data representing an utterance detected by a microphone of the client device; and wherein providing the transcription comprises providing, over the computer network and to the client device, data indicating the transcription for display at a screen of the client device. 16 . The system of claim 1 , wherein the time-frequency feature representation is not a log-mel feature. 17 . A method performed by data processing apparatus, the method comprising: generating a time-frequency feature representation based on audio data; inputting the time-frequency feature representation to an acoustic model comprising a trained artificial neural network, the trained artificial neural network comprising a frequency convolution layer, a memory layer, and one or more hidden layers; receiving, from the acoustic model, an output that is based on output of the trained artificial neural network and that is indicative of a likelihood that the audio data corresponds to a phonetic unit; and providing a transcription for the audio data that is determined based on the output of the acoustic model. 18 . The method of claim 17 , wherein the trained artificial neural network comprises multiple long short-term memory layers, and wherein the output of at least one of the long short-term memory layers is input to another of the long short-term memory layers. 19 . A computer-readable storage de

Assignees

Google Inc

Inventors

Classifications

G06N3/045
Combinations of networks · CPC title
G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/084
Backpropagation, e.g. using gradient descent · CPC title
G10L15/142
Hidden Markov Models [HMMs] · CPC title
G10L15/16Primary
using artificial neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 56974272

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016284347A1 cover?: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing audio waveforms. In some implementations, a time-frequency feature representation is generated based on audio data. The time-frequency feature representation is input to an acoustic model comprising a trained artificial neural network. The trained artificial neural network comprisin…
Who is the assignee on this patent?: Google Inc
What technology area does this patent fall under?: Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Sep 29 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Adaptive audio enhancement for multichannel speech recognition

Complex linear projection for acoustic modeling

Automatic speech recognition using multi-dimensional models

Enhanced multi-channel acoustic models

Adaptive audio enhancement for multichannel speech recognition

Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition

Frequently asked questions