Multichannel speech recognition using neural networks

US11062725B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11062725-B2
Application numberUS-201916278830-A
CountryUS
Kind codeB2
Filing dateFeb 19, 2019
Priority dateSep 7, 2016
Publication dateJul 13, 2021
Grant dateJul 13, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes generating, by a spatial filtering layer of the neural network, a spatial filtered output using the first data and the second data, and generating, by a spectral filtering layer of the neural network, a spectral filtered output using the spatial filtered output. Generating the spectral filtered output comprises processing frequency-domain data representing the spatial filtered output. The method still further includes processing, by one or more additional layers of the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, at data processing hardware, a multi-channel audio input comprising a first audio signal and a second audio signal occurring during a same period of time; obtaining, by the data processing hardware, a first time-domain representation of the first audio signal and a second time-domain representation of the second audio signal; generating, by the data processing hardware, using a spatial filtering convolutional layer of a neural network configured to perform spatial filtering, a corresponding spatial filtered output for each of multiple spatial directions by processing the first time-domain representation of the first audio signal and the second time-domain representation of the second audio signal, the spatial filtering convolutional layer using a stride parameter being set to an integer greater than one; converting, by the data processing hardware, the corresponding spatial filtered output generated for each of the multiple spatial directions into corresponding frequency-domain data; and processing, by the data processing hardware, using one or more additional neural network layers of the neural network, the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions to predict speech content encoded in the first audio signal and the second audio signal. 2. The method of claim 1 , wherein converting the corresponding spatial filtered output generated for each of the multiple spatial directions into corresponding frequency-domain data comprises computing a discrete Fourier transform for the corresponding spatial filtered output generated for each of the multiple spatial directions. 3. The method of claim 2 , wherein computing the discrete Fourier transform for the corresponding spatial filtered output generated for each of the multiple spatial directions comprises computing a fast Fourier transform for the corresponding spatial filtered output generated for each of the multiple spatial directions. 4. The method of claim 1 , wherein the neural network is part of a speech recognition model. 5. The method of claim 1 , wherein the neural network is part of an acoustic model configured to indicate probabilities of sub-word units. 6. The method of claim 1 , wherein the one or more additional neural network layers comprise one or more deep neural network layers that provide output to one or more long short-term memory layers. 7. The method of claim 1 , wherein the corresponding spatial filtered output generated for each of the multiple spatial directions comprises a single channel of time-domain data. 8. The method of claim 1 , wherein at least one additional neural network layer of the one or more additional neural network layers is configured to perform feature extraction. 9. The method of claim 8 , wherein the at least one additional neural network layer of the one or more additional neural network layers that is configured to perform feature extraction is also configured to apply a transformation to the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions. 10. The method of claim 9 , wherein the transformation is a linear transformation. 11. The method of claim 9 , wherein the transformation is a projection. 12. The method of claim 9 , wherein the transformation is a complex linear projection. 13. The method of claim 9 , wherein the transformation is a linear projection of energy. 14. The method of claim 1 , wherein the neural network comprises: the spatial filtering convolutional layer; at least one feature extraction neural network layer configured to determine frequency-based characteristics of the corresponding frequency-domain data converted from the spatially filtered output generated for each of the multiple spatial directions; and one or more neural network layers configured to receive output of the at least one feature extraction neural network layer and determine speech content using one or more recurrent neural network layers and one or more deep neural network layers. 15. The method of claim 1 , further comprising: detecting, by the data processing hardware, the first audio signal and the second audio signal using multiple microphones of a computing device, wherein the data processing hardware resides on the computing device. 16. The method of claim 1 , further comprising: detecting, by the data processing hardware, the first audio signal and the second audio signal using multiple microphones of a computing device; wherein the neural network is stored or implemented on the computing device. 17. The method of claim 1 , wherein processing the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions comprises identifying a voice command indicated by the first audio signal and the second audio signal. 18. The method of claim 1 , wherein the spatial filtering convolutional layer and the one or more additional layers have been jointly trained during training of the neural network. 19. A system comprising: one or more computing devices; and one or more computer-readable media storing instructions that, when executed by the one or more computing devices, cause the one or more computing devices to perform operations comprising: receiving a multi-channel audio input comprising a first audio signal and a second audio signal occurring during a same period of time; obtaining a first time-domain representation of the first audio signal and a second time-domain representation of the second audio signal; generating, using a spatial filtering convolutional layer of a neural network configured to perform spatial filtering, a corresponding spatial filtered output for each of multiple spatial directions by processing the first time-domain representation of the first audio signal and the second time-domain representation of the second audio signal, the spatial filtering convolutional layer using a stride parameter being set to an integer greater than one; converting the corresponding spatial filtered output generated for each of the multiple spatial directions into corresponding frequency-domain data; and processing, using one or more additional neural network layers of the neural network, the corresponding frequency-domain data converted from the spatial filtered output generated for each of the multiple spatial directions to predict speech content encoded in the first audio signal and the second audio signal. 20. One or more non-transitory computer-readable media storing instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations comprising: receiving a multi-channel audio input comprising a first audio signal and a second audio signal occurring during a same period of time; obtaining a first time-domain representation of the first audio signal and a second time-domain representation of the second audio signal; generating, using a spatial filtering convolutional layer of a neural network configured to perform spatial filtering, a corresponding spatial filtered output for each of multiple spatial directions by processing the first time-domain representation of the first audio signal and the second time-domain representation of the second audio signal, the spatial filtering convolutional layer using a stride parameter being set to an integer greater than one; co

Assignees

Inventors

Classifications

  • G10L21/028Primary

    using properties of sound source · CPC title

  • Details of processing therefor · CPC title

  • Microphone arrays; Beamforming · CPC title

  • the noise being separate speech, e.g. cocktail party · CPC title

  • Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11062725B2 cover?
This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes …
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L21/028. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 13 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).