Audio-visual separation of on-screen sounds based on machine learning models

US11756570B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11756570-B2
Application numberUS-202117214186-A
CountryUS
Kind codeB2
Filing dateMar 26, 2021
Priority dateMar 26, 2021
Publication dateSep 12, 2023
Grant dateSep 12, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Apparatus and methods related to separation of audio sources are provided. The method includes receiving an audio waveform associated with a plurality of video frames. The method includes estimating, by a neural network, one or more audio sources associated with the plurality of video frames. The method includes generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources. The method includes determining, based on the audio embeddings and a video embedding, whether one or more audio sources of the one or more estimated audio sources correspond to objects in the plurality of video frames. The method includes predicting, by the neural network and based on the one or more audio embeddings and the video embedding, a version of the audio waveform comprising audio sources that correspond to objects in the plurality of video frames.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: receiving, by a computing device, an audio waveform associated with a plurality of video frames; estimating, by a neural network and from the audio waveform, one or more audio sources associated with the plurality of video frames; generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources; generating, by the neural network and for each audio embedding corresponding to the one or more estimated audio sources and based on a video embedding of the plurality of video frames, a spatio-temporal audio-visual embedding based on an attention operation that aligns the one or more estimated audio sources with spatio-temporal positions of on-screen objects in the plurality of video frames; determining, by the neural network and based on the spatio-temporal audio-visual embedding, whether one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames; and predicting, by the neural network, a version of the audio waveform comprising audio sources that correspond to the on-screen objects in the plurality of video frames. 2. The computer-implemented method of claim 1 , further comprising: responsive to determining that a particular audio source of the one or more audio sources corresponds to a particular object in the plurality of video frames, modifying an audio content associated with the particular audio source to produce a version of the audio waveform based on the modified audio content. 3. The computer-implemented method of claim 2 , further comprising: providing the version of the audio waveform using the computing device. 4. The computer-implemented method of claim 1 , wherein the neural network comprises a classifier, wherein a first attention pooling is applied to generate the one or more audio embeddings, a second attention pooling is applied to generate the video embedding, and wherein the determining of whether the one or more audio sources correspond to the on-screen objects in the plurality of video frames comprises applying the classifier based on the one or more audio embeddings and the video embedding. 5. The computer-implemented method of claim 1 , wherein the neural network comprises a classifier, and attentional pooling is applied to the one or more audio embeddings and the video embedding, to produce a representation, wherein the determining of whether the one or more audio sources correspond to the on-screen objects in the plurality of video frames comprises applying the classifier based on the representation. 6. The computer-implemented method of claim 1 , further comprising: determining, by the computing device, a request to identify on-screen audio sources in the plurality of video frames; sending the request to identify the on-screen audio sources from the computing device to a second computing device, the second computing device comprising a trained version of the neural network; after sending the request, the computing device receiving, from the second computing device, the determining of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames; and outputting the version of the waveform comprising the identified on-screen audio sources based on the received determination of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames. 7. The computer-implemented method of claim 1 , wherein the neural network comprises: an audio separation network to perform the generating of the one or more estimated audio sources; and an audio embedding network to generate the one or more audio embeddings based on the one or more estimated audio sources, wherein the one or more audio embeddings comprise a representation of audio features. 8. The computer-implemented method of claim 7 , wherein the neural network comprises a video embedding network to generate a global video embedding comprising a global representation of video features in the plurality of video frames. 9. The computer-implemented method of claim 8 , further comprising: generating, based on the one or more audio embeddings and the global video embedding, an audio-visual embedding, and wherein the determination of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames is based on the audio-visual embedding. 10. The computer-implemented method of claim 7 , wherein the neural network comprises a video embedding network to generate a local video embedding comprising, for each video frame of the plurality of video frames, a temporal representation of video features in the plurality of video frames. 11. The computer-implemented method of claim 10 , further comprising: generating, based on the one or more audio embeddings and the local video embeddings, an audio-visual embedding, and wherein the determination of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames is based on the audio-visual embedding. 12. The computer-implemented method of claim 1 , further comprising: training the neural network to receive a particular audio waveform associated with a particular plurality of video frames and predict a version of the particular audio waveform comprising particular audio sources that correspond to particular objects in the particular plurality of video frames. 13. The computer-implemented method of claim 12 , wherein the training of the neural network comprises training a classifier based on active combinations cross entropy. 14. The computer-implemented method of claim 12 , wherein the training of the neural network is performed at the computing device. 15. The computer-implemented method of claim 12 , wherein the training of the neural network comprises unsupervised mixture invariant training. 16. The computer-implemented method of claim 12 , wherein a training dataset for the training of the neural network comprises in-the-wild videos. 17. The computer-implemented method of claim 1 , wherein the computing device comprises a camera, and the method further comprising: generating video content using the camera; and receiving, at the computing device, the generated video content from the camera. 18. The computer-implemented method of claim 1 , further comprising: obtaining a trained neural network at the computing device, and wherein the predicting of the version of the audio waveform comprising the audio sources that correspond to the on-screen objects in the plurality of video frames comprises predicting by the computing device using the trained neural network. 19. The computer-implemented method of claim 1 , further comprising: identifying a portion of an image in video content; determining that a particular audio source of the one or more estimated audio sources corresponds to a particular object in the identified portion of the video content; and modifying an audio content corresponding to the particular audio source. 20. A computing device, comprising: one or more processors; and data storage, wherein the data storage has stored thereon computer-executable instructions that, when executed by the one or more processors, cause the computing device to ca

Assignees

Inventors

Classifications

  • G10L25/30Primary

    using neural networks · CPC title

  • G10L25/57Primary

    for processing of video signals · CPC title

  • characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques · CPC title

  • Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11756570B2 cover?
Apparatus and methods related to separation of audio sources are provided. The method includes receiving an audio waveform associated with a plurality of video frames. The method includes estimating, by a neural network, one or more audio sources associated with the plurality of video frames. The method includes generating, by the neural network, one or more audio embeddings corresponding to th…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L25/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 12 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).