Methods and systems for augmenting audio content
US-11456004-B2 · Sep 27, 2022 · US
US11756570B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11756570-B2 |
| Application number | US-202117214186-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 26, 2021 |
| Priority date | Mar 26, 2021 |
| Publication date | Sep 12, 2023 |
| Grant date | Sep 12, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Apparatus and methods related to separation of audio sources are provided. The method includes receiving an audio waveform associated with a plurality of video frames. The method includes estimating, by a neural network, one or more audio sources associated with the plurality of video frames. The method includes generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources. The method includes determining, based on the audio embeddings and a video embedding, whether one or more audio sources of the one or more estimated audio sources correspond to objects in the plurality of video frames. The method includes predicting, by the neural network and based on the one or more audio embeddings and the video embedding, a version of the audio waveform comprising audio sources that correspond to objects in the plurality of video frames.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: receiving, by a computing device, an audio waveform associated with a plurality of video frames; estimating, by a neural network and from the audio waveform, one or more audio sources associated with the plurality of video frames; generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources; generating, by the neural network and for each audio embedding corresponding to the one or more estimated audio sources and based on a video embedding of the plurality of video frames, a spatio-temporal audio-visual embedding based on an attention operation that aligns the one or more estimated audio sources with spatio-temporal positions of on-screen objects in the plurality of video frames; determining, by the neural network and based on the spatio-temporal audio-visual embedding, whether one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames; and predicting, by the neural network, a version of the audio waveform comprising audio sources that correspond to the on-screen objects in the plurality of video frames. 2. The computer-implemented method of claim 1 , further comprising: responsive to determining that a particular audio source of the one or more audio sources corresponds to a particular object in the plurality of video frames, modifying an audio content associated with the particular audio source to produce a version of the audio waveform based on the modified audio content. 3. The computer-implemented method of claim 2 , further comprising: providing the version of the audio waveform using the computing device. 4. The computer-implemented method of claim 1 , wherein the neural network comprises a classifier, wherein a first attention pooling is applied to generate the one or more audio embeddings, a second attention pooling is applied to generate the video embedding, and wherein the determining of whether the one or more audio sources correspond to the on-screen objects in the plurality of video frames comprises applying the classifier based on the one or more audio embeddings and the video embedding. 5. The computer-implemented method of claim 1 , wherein the neural network comprises a classifier, and attentional pooling is applied to the one or more audio embeddings and the video embedding, to produce a representation, wherein the determining of whether the one or more audio sources correspond to the on-screen objects in the plurality of video frames comprises applying the classifier based on the representation. 6. The computer-implemented method of claim 1 , further comprising: determining, by the computing device, a request to identify on-screen audio sources in the plurality of video frames; sending the request to identify the on-screen audio sources from the computing device to a second computing device, the second computing device comprising a trained version of the neural network; after sending the request, the computing device receiving, from the second computing device, the determining of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames; and outputting the version of the waveform comprising the identified on-screen audio sources based on the received determination of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames. 7. The computer-implemented method of claim 1 , wherein the neural network comprises: an audio separation network to perform the generating of the one or more estimated audio sources; and an audio embedding network to generate the one or more audio embeddings based on the one or more estimated audio sources, wherein the one or more audio embeddings comprise a representation of audio features. 8. The computer-implemented method of claim 7 , wherein the neural network comprises a video embedding network to generate a global video embedding comprising a global representation of video features in the plurality of video frames. 9. The computer-implemented method of claim 8 , further comprising: generating, based on the one or more audio embeddings and the global video embedding, an audio-visual embedding, and wherein the determination of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames is based on the audio-visual embedding. 10. The computer-implemented method of claim 7 , wherein the neural network comprises a video embedding network to generate a local video embedding comprising, for each video frame of the plurality of video frames, a temporal representation of video features in the plurality of video frames. 11. The computer-implemented method of claim 10 , further comprising: generating, based on the one or more audio embeddings and the local video embeddings, an audio-visual embedding, and wherein the determination of whether the one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames is based on the audio-visual embedding. 12. The computer-implemented method of claim 1 , further comprising: training the neural network to receive a particular audio waveform associated with a particular plurality of video frames and predict a version of the particular audio waveform comprising particular audio sources that correspond to particular objects in the particular plurality of video frames. 13. The computer-implemented method of claim 12 , wherein the training of the neural network comprises training a classifier based on active combinations cross entropy. 14. The computer-implemented method of claim 12 , wherein the training of the neural network is performed at the computing device. 15. The computer-implemented method of claim 12 , wherein the training of the neural network comprises unsupervised mixture invariant training. 16. The computer-implemented method of claim 12 , wherein a training dataset for the training of the neural network comprises in-the-wild videos. 17. The computer-implemented method of claim 1 , wherein the computing device comprises a camera, and the method further comprising: generating video content using the camera; and receiving, at the computing device, the generated video content from the camera. 18. The computer-implemented method of claim 1 , further comprising: obtaining a trained neural network at the computing device, and wherein the predicting of the version of the audio waveform comprising the audio sources that correspond to the on-screen objects in the plurality of video frames comprises predicting by the computing device using the trained neural network. 19. The computer-implemented method of claim 1 , further comprising: identifying a portion of an image in video content; determining that a particular audio source of the one or more estimated audio sources corresponds to a particular object in the identified portion of the video content; and modifying an audio content corresponding to the particular audio source. 20. A computing device, comprising: one or more processors; and data storage, wherein the data storage has stored thereon computer-executable instructions that, when executed by the one or more processors, cause the computing device to ca
using neural networks · CPC title
for processing of video signals · CPC title
characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques · CPC title
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
using neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.