Sound source direction estimation device, sound source direction estimation method, and program
US-2021020190-A1 · Jan 21, 2021 · US
US11816577B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11816577-B2 |
| Application number | US-202117487548-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 28, 2021 |
| Priority date | May 18, 2018 |
| Publication date | Nov 14, 2023 |
| Grant date | Nov 14, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Generally, the present disclosure is directed to systems and methods that generate augmented training data for machine-learned models via application of one or more augmentation techniques to audiographic images that visually represent audio signals. In particular, the present disclosure provides a number of novel augmentation operations which can be performed directly upon the audiographic image (e.g., as opposed to the raw audio data) to generate augmented training data that results in improved model performance. As an example, the audiographic images can be or include one or more spectrograms or filter bank sequences.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method to generate augmented training data, the method comprising: obtaining, by one or more computing devices, a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal; generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images, wherein the one or more augmentation operations includes: a time warping operation comprising warping image content of the audiographic image along an axis representative of time, a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image; inputting, by the one or more computing devices, the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions; evaluating, by the one or more computing devices, an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function. 2. The computer-implemented method of claim 1 , wherein performing the time warping operation comprises fixing spatial dimensions of the audiographic image and warping the image content of the audiographic image to shift a point within the image content a distance along the axis representative of time. 3. The computer-implemented method of claim 2 , wherein the distance comprises a user-specified hyperparameter or a learned value. 4. The computer-implemented method of claim 2 , wherein the point within the image content is randomly selected. 5. The computer-implemented method of claim 1 , wherein: the certain subset of frequencies extends from a first frequency to a second frequency that is spaced a distance from the first frequency; the distance is selected from a distribution extending from zero to a frequency mask parameter. 6. The computer-implemented method of claim 1 , wherein changing the pixel values for the image content associated with the certain subset of frequencies comprises changing the pixel values for the image content to equal a mean value associated with the audiographic image. 7. The computer-implemented method of claim 1 , wherein performing the frequency masking operation comprises enforcing an upper bound on a ratio of the certain subset of frequencies to all frequencies. 8. The computer-implemented method of claim 1 , wherein: the certain subset of time steps extends from a first time step to a second time step that is spaced a distance from the first time step; the distance is selected from a distribution extending from zero to a time mask parameter. 9. The computer-implemented method of claim 1 , wherein changing the pixel values for the image content associated with the certain subset of time steps comprises changing the pixel values for the image content to equal a mean value associated with the audiographic image. 10. The computer-implemented method of claim 1 , wherein performing the time masking operation comprises enforcing an upper bound on a ratio of the certain subset of time steps to all time steps. 11. The computer-implemented method of claim 1 , wherein each of the audiographic images comprises: one or more filter bank sequences. 12. The computer-implemented method of claim 1 , wherein: the audio signal encodes one or more human speech utterances; and the one or more predictions generated by the machine-learned audio processing model comprise: one or more textual transcriptions of the one or more human speech utterances. 13. The computer-implemented method of claim 1 , wherein the machine-learned audio processing model comprises one or both of: a hybrid hidden Markov model and deep neural network. 14. The computer-implemented method of claim 1 , wherein the machine-learned audio processing model comprises: a convolutional neural network. 15. The computer-implemented method of claim 1 , wherein each audiographic image comprises a combination of τ time steps of the plurality of times of the audio signal. 16. The computer-implemented method of claim 1 , wherein the plurality of augmented images is generated using at least two of the one or more augmentation operations. 17. The computer-implemented method of claim 1 , wherein the plurality of augmented images is generated using at least three of the one or more augmentation operations. 18. The computer-implemented method of claim 1 , wherein the machine- learned audio processing model performs automatic speech recognition on the audio signal as represented by the plurality of audiographic images. 19. The computer-implemented method of claim 1 , wherein the machine-learned audio processing model performs speech-to-speech translation. 20. The computer-implemented method of claim 1 , wherein the machine-learned audio processing model performs voice conversion on the audio signal. 21. One or more non-transitory computer-readable media that collective store instructions that, when executed by one or more processors, cause the one or more processors to perform operations, the operations comprising: obtaining a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal; generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images, wherein the one or more augmentation operations includes: a time warping operation comprising warping image content of the audiographic image along an axis representative of time, a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image; inputting the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions; evaluating an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function. 22. A computing system comprising: one or more processors; a controller model; and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: accessing a training dataset that comprises a plurality of training images, wherein each training image comprises an audiographic image that visually represents an audio signal, and wherein the plurality of training images correspond to a plurality of times of the audio signal; and for each of a plurality of iterations: selecting, by the controller model, a series of one or more augmentation o
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Supervised learning · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.