What technology area does this patent fall under?

Primary CPC classification G10L25/30. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 12 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Audio-visual separation of on-screen sounds based on machine learning models

US11756570B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11756570-B2
Application number	US-202117214186-A
Country	US
Kind code	B2
Filing date	Mar 26, 2021
Priority date	Mar 26, 2021
Publication date	Sep 12, 2023
Grant date	Sep 12, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Apparatus and methods related to separation of audio sources are provided. The method includes receiving an audio waveform associated with a plurality of video frames. The method includes estimating, by a neural network, one or more audio sources associated with the plurality of video frames. The method includes generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources. The method includes determining, based on the audio embeddings and a video embedding, whether one or more audio sources of the one or more estimated audio sources correspond to objects in the plurality of video frames. The method includes predicting, by the neural network and based on the one or more audio embeddings and the video embedding, a version of the audio waveform comprising audio sources that correspond to objects in the plurality of video frames.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: receiving, by a computing device, an audio waveform associated with a plurality of video frames; estimating, by a neural network and from the audio waveform, one or more audio sources associated with the plurality of video frames; generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources; generating, by the neural network and for each audio embedding corresponding to the one or more estimated audio sources and based on a video embedding of the plurality of video frames, a spatio-temporal audio-visual embedding based on an attention operation that aligns the one or more estimated audio sources with spatio-temporal positions of on-screen objects in the plurality of video frames; determining, by the neural network and based on the spatio-temporal audio-visual embedding, whether one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames; and predicting, by the neural network, a version of the audio waveform comprising audio sources that correspond to the on-screen objects in the plurality of video frames. 2. The computer-implemented method of claim 1 , further comprising: responsive to determining that a particular audio source of the one or more audio sources corresponds to a particular object in the plurality of video frames, modifying an audio content associated with the particular audio source to produce a version of the audio waveform based on the modified audio content. 3. The computer-implemented method of claim 2 , further comprising: providing the version of the audio waveform using the computing device. 4. The computer-implemented method of claim 1 , wherein the neural network comprises a classifier, wherein a first attention pooling is applied to generate the one or more audio embeddings, a second attention pooling is applied to generate the video embedding, and wherein the determining of whether the one or more audio sources correspond to the on-screen objects in the plurality of video frames comprises applying the classifier based on the one or more audio embeddings and the video embedding. 5. The computer-implemented method of claim 1 , wherein the neural network comprises a classifier, and attentional pooling is applied to the one or more audio embeddings and the video embedding, to produce a representation, wherein the determining of whether the one or more audio sources correspond to the on-screen objects in the plurality of video frames comprises applying the classifier based on the representation. 6. The computer-implemented method of claim 1 , further comprising: determining, by the computing device, a request to identify on-screen audio sources in the plurality of video frames; sending the request to identify the on-screen audio sources from the computing device to a second computing device, the second computing device comprising a trained version of the neural network; after sending the request, the computing device receiving, from the second computing device, the determining of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames; and outputting the version of the waveform comprising the identified on-screen audio sources based on the received determination of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames. 7. The computer-implemented method of claim 1 , wherein the neural network comprises: an audio separation network to perform the generating of the one or more estimated audio sources; and an audio embedding network to generate the one or more audio embeddings based on the one or more estimated audio sources, wherein the one or more audio embeddings comprise a representation of audio features. 8. The computer-implemented method of claim 7 , wherein the neural network comprises a video embedding network to generate a global video embedding comprising a global representation of video features in the plurality of video frames. 9. The computer-implemented method of claim 8 , further comprising: generating, based on the one or more audio embeddings and the global video embedding, an audio-visual embedding, and wherein the determination of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames is based on the audio-visual embedding. 10. The computer-implemented method of claim 7 , wherein the neural network comprises a video embedding network to generate a local video embedding comprising, for each video frame of the plurality of video frames, a temporal representation of video features in the plurality of video frames. 11. The computer-implemented method of claim 10 , further comprising: generating, based on the one or more audio embeddings and the local video embeddings, an audio-visual embedding, and wherein the determination of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames is based on the audio-visual embedding. 12. The computer-implemented method of claim 1 , further comprising: training the neural network to receive a particular audio waveform associated with a particular plurality of video frames and predict a version of the particular audio waveform comprising particular audio sources that correspond to particular objects in the particular plurality of video frames. 13. The computer-implemented method of claim 12 , wherein the training of the neural network comprises training a classifier based on active combinations cross entropy. 14. The computer-implemented method of claim 12 , wherein the training of the neural network is performed at the computing device. 15. The computer-implemented method of claim 12 , wherein the training of the neural network comprises unsupervised mixture invariant training. 16. The computer-implemented method of claim 12 , wherein a training dataset for the training of the neural network comprises in-the-wild videos. 17. The computer-implemented method of claim 1 , wherein the computing device comprises a camera, and the method further comprising: generating video content using the camera; and receiving, at the computing device, the generated video content from the camera. 18. The computer-implemented method of claim 1 , further comprising: obtaining a trained neural network at the computing device, and wherein the predicting of the version of the audio waveform comprising the audio sources that correspond to the on-screen objects in the plurality of video frames comprises predicting by the computing device using the trained neural network. 19. The computer-implemented method of claim 1 , further comprising: identifying a portion of an image in video content; determining that a particular audio source of the one or more estimated audio sources corresponds to a particular object in the identified portion of the video content; and modifying an audio content corresponding to the particular audio source. 20. A computing device, comprising: one or more processors; and data storage, wherein the data storage has stored thereon computer-executable instructions that, when executed by the one or more processors, cause the computing device to ca

Assignees

Google Llc

Inventors

Classifications

G10L25/30Primary
using neural networks · CPC title
G10L25/57Primary
for processing of video signals · CPC title
G10L21/0308
characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques · CPC title
G06V10/774
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
G06V10/82
using neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 83363629

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11756570B2 cover?: Apparatus and methods related to separation of audio sources are provided. The method includes receiving an audio waveform associated with a plurality of video frames. The method includes estimating, by a neural network, one or more audio sources associated with the plurality of video frames. The method includes generating, by the neural network, one or more audio embeddings corresponding to th…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L25/30. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 12 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).