Augmentation of audiographic images for improved machine learning

US11816577B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11816577-B2
Application numberUS-202117487548-A
CountryUS
Kind codeB2
Filing dateSep 28, 2021
Priority dateMay 18, 2018
Publication dateNov 14, 2023
Grant dateNov 14, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Generally, the present disclosure is directed to systems and methods that generate augmented training data for machine-learned models via application of one or more augmentation techniques to audiographic images that visually represent audio signals. In particular, the present disclosure provides a number of novel augmentation operations which can be performed directly upon the audiographic image (e.g., as opposed to the raw audio data) to generate augmented training data that results in improved model performance. As an example, the audiographic images can be or include one or more spectrograms or filter bank sequences.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method to generate augmented training data, the method comprising: obtaining, by one or more computing devices, a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal; generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images, wherein the one or more augmentation operations includes: a time warping operation comprising warping image content of the audiographic image along an axis representative of time, a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image; inputting, by the one or more computing devices, the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions; evaluating, by the one or more computing devices, an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and modifying, by the one or more computing devices, respective values of one or more parameters of the machine-learned audio processing model based on the objective function. 2. The computer-implemented method of claim 1 , wherein performing the time warping operation comprises fixing spatial dimensions of the audiographic image and warping the image content of the audiographic image to shift a point within the image content a distance along the axis representative of time. 3. The computer-implemented method of claim 2 , wherein the distance comprises a user-specified hyperparameter or a learned value. 4. The computer-implemented method of claim 2 , wherein the point within the image content is randomly selected. 5. The computer-implemented method of claim 1 , wherein: the certain subset of frequencies extends from a first frequency to a second frequency that is spaced a distance from the first frequency; the distance is selected from a distribution extending from zero to a frequency mask parameter. 6. The computer-implemented method of claim 1 , wherein changing the pixel values for the image content associated with the certain subset of frequencies comprises changing the pixel values for the image content to equal a mean value associated with the audiographic image. 7. The computer-implemented method of claim 1 , wherein performing the frequency masking operation comprises enforcing an upper bound on a ratio of the certain subset of frequencies to all frequencies. 8. The computer-implemented method of claim 1 , wherein: the certain subset of time steps extends from a first time step to a second time step that is spaced a distance from the first time step; the distance is selected from a distribution extending from zero to a time mask parameter. 9. The computer-implemented method of claim 1 , wherein changing the pixel values for the image content associated with the certain subset of time steps comprises changing the pixel values for the image content to equal a mean value associated with the audiographic image. 10. The computer-implemented method of claim 1 , wherein performing the time masking operation comprises enforcing an upper bound on a ratio of the certain subset of time steps to all time steps. 11. The computer-implemented method of claim 1 , wherein each of the audiographic images comprises: one or more filter bank sequences. 12. The computer-implemented method of claim 1 , wherein: the audio signal encodes one or more human speech utterances; and the one or more predictions generated by the machine-learned audio processing model comprise: one or more textual transcriptions of the one or more human speech utterances. 13. The computer-implemented method of claim 1 , wherein the machine-learned audio processing model comprises one or both of: a hybrid hidden Markov model and deep neural network. 14. The computer-implemented method of claim 1 , wherein the machine-learned audio processing model comprises: a convolutional neural network. 15. The computer-implemented method of claim 1 , wherein each audiographic image comprises a combination of τ time steps of the plurality of times of the audio signal. 16. The computer-implemented method of claim 1 , wherein the plurality of augmented images is generated using at least two of the one or more augmentation operations. 17. The computer-implemented method of claim 1 , wherein the plurality of augmented images is generated using at least three of the one or more augmentation operations. 18. The computer-implemented method of claim 1 , wherein the machine- learned audio processing model performs automatic speech recognition on the audio signal as represented by the plurality of audiographic images. 19. The computer-implemented method of claim 1 , wherein the machine-learned audio processing model performs speech-to-speech translation. 20. The computer-implemented method of claim 1 , wherein the machine-learned audio processing model performs voice conversion on the audio signal. 21. One or more non-transitory computer-readable media that collective store instructions that, when executed by one or more processors, cause the one or more processors to perform operations, the operations comprising: obtaining a plurality of audiographic images that visually represent an audio signal, wherein the plurality of audiographic images correspond to a plurality of times of the audio signal; generating, using one or more augmentation operations, a plurality of augmented images based on the plurality of audiographic images, wherein the one or more augmentation operations includes: a time warping operation comprising warping image content of the audiographic image along an axis representative of time, a frequency masking operation comprising changing pixel values for image content associated with a certain subset of frequencies represented by the audiographic image, or a time masking operation comprising changing pixel values for image content associated with a certain subset of a time steps represented by the audiographic image; inputting the plurality of augmented images into a machine-learned audio processing model to generate one or more predictions; evaluating an objective function that scores the one or more predictions generated by the machine-learned audio processing model; and modifying respective values of one or more parameters of the machine-learned audio processing model based on the objective function. 22. A computing system comprising: one or more processors; a controller model; and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: accessing a training dataset that comprises a plurality of training images, wherein each training image comprises an audiographic image that visually represents an audio signal, and wherein the plurality of training images correspond to a plurality of times of the audio signal; and for each of a plurality of iterations: selecting, by the controller model, a series of one or more augmentation o

Assignees

Inventors

Classifications

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Supervised learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11816577B2 cover?
Generally, the present disclosure is directed to systems and methods that generate augmented training data for machine-learned models via application of one or more augmentation techniques to audiographic images that visually represent audio signals. In particular, the present disclosure provides a number of novel augmentation operations which can be performed directly upon the audiographic ima…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 14 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).