Self-supervised ai-assisted sound effect recommendation for silent video
US-2021319321-A1 · Oct 14, 2021 · US
US11735197B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11735197-B2 |
| Application number | US-202016922543-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 7, 2020 |
| Priority date | Jul 7, 2020 |
| Publication date | Aug 22, 2023 |
| Grant date | Aug 22, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods of the present disclosure are directed toward digital signal processing using machine-learned differentiable digital signal processors. For example, embodiments of the present disclosure may include differentiable digital signal processors within the training loop of a machine-learned model (e.g., for gradient-based training). Advantageously, systems and methods of the present disclosure provide high quality signal processing using smaller models than prior systems, thereby reducing energy costs (e.g., storage and/or processing costs) associated with performing digital signal processing.
Opening claim text (preview).
What is claimed is: 1. A computing system for the synthesis of an output audio waveform based on an input audio waveform, comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store: one or more digital signal processors for processing the input audio waveform; and instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining one or more control inputs for controlling the one or more digital signal processors, wherein the one or more control inputs are generated from one or more latent representations of acoustic features of a reference audio source, the one or more latent representations having been generated by a machine-learned model trained by backpropagation of a loss determined by comparing a recording of the reference audio source and a synthesized recording thereof; inputting the one or more control inputs and the input audio waveform into the one or more digital signal processors; and synthesizing the output audio waveform with the one or more digital signal processors. 2. The computing system of claim 1 , wherein the recording of the reference audio source is different from the input audio waveform. 3. The computing system of claim 1 , wherein the one or more digital signal processors comprises one or more of a linear time-varying filter, a linear time-invariant filter, a finite impulse response filter, an infinite impulse response filter, an oscillator, a short-time Fourier transform, a parametric equalization processor, an effects processor, an additive synthesizer, a subtractive synthesizer, or a wavetable synthesizer. 4. The computing system of claim 1 , wherein the one or more digital signal processors comprises an additive synthesizer and a subtractive synthesizer for generating the output audio waveform. 5. The computing system of claim 4 , wherein the additive synthesizer comprises an oscillator and the subtractive synthesizer comprises a linear time-varying filter applied to a noise source. 6. The computing system of claim 4 , wherein the control inputs comprise reverberation control inputs obtained by recreating a reverberation effect of the reference audio source using a reverberation digital signal processor. 7. The computing system of claim 1 , wherein the output audio waveform comprises a speech waveform. 8. The computing system of claim 1 , wherein the machine-learned model comprises an encoder for processing the model input and a decoder for outputting the one or more control inputs. 9. The computing system of claim 1 , wherein the loss comprises a spectral loss. 10. The computing system of claim 9 , wherein the spectral loss is a multi-scale spectral loss. 11. One or more non-transitory computer-readable media that collectively store: one or more digital signal processors for processing an input audio waveform; and instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining one or more control inputs for controlling the one or more digital signal processors, wherein the one or more control inputs are generated from one or more latent representations of acoustic features of a reference audio source, the one or more latent representations having been generated by a machine-learned model trained by backpropagation of a loss determined by comparing a recording of the reference audio source and a synthesized recording thereof; inputting the one or more control inputs and the input audio waveform into the one or more digital signal processors; and synthesizing an output audio waveform with the one or more digital signal processors. 12. The one or more non-transitory computer-readable media of claim 11 , wherein the recording of the reference audio source is different from the input audio waveform. 13. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more digital signal processors comprises one or more of a linear time-varying filter, a linear time-invariant filter, a finite impulse response filter, an infinite impulse response filter, an oscillator, a short-time Fourier transform, a parametric equalization processor, an effects processor, an additive synthesizer, a subtractive synthesizer, or a wavetable synthesizer. 14. The one or more non-transitory computer-readable media of claim 11 , wherein the one or more digital signal processors comprises an additive synthesizer and a subtractive synthesizer for generating the output audio waveform. 15. The one or more non-transitory computer-readable media of claim 14 , wherein the additive synthesizer comprises an oscillator and the subtractive synthesizer comprises a linear time-varying filter applied to a noise source. 16. The one or more non-transitory computer-readable media of claim 14 , wherein the control inputs comprise reverberation control inputs obtained by recreating a reverberation effect of the reference audio source using a reverberation digital signal processor. 17. The one or more non-transitory computer-readable media of claim 11 , wherein the output audio waveform comprises a speech waveform. 18. The one or more non-transitory computer-readable media of claim 11 , wherein the machine-learned model comprises an encoder for processing the model input and a decoder for outputting the one or more control inputs. 19. The one or more non-transitory computer-readable media of claim 11 , wherein the loss is a multi-scale spectral loss. 20. A method for the synthesis of an output audio waveform based on an input audio waveform, comprising: obtaining, by a computing system comprising one or more processors, one or more control inputs for controlling one or more digital signal processors, wherein the one or more control inputs are generated from one or more latent representations of acoustic features of a reference audio source, the one or more latent representations having been generated by a machine-learned model trained by backpropagation of a loss determined by comparing a recording of the reference audio source and a synthesized recording thereof; inputting, by the computing system, the one or more control inputs and the input audio waveform into the one or more digital signal processors; and synthesizing, by the computing system, the output audio waveform with the one or more digital signal processors. 21. A computing system that combines machine learning with digital signal processors, the computing system comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store: one or more differentiable digital signal processors configured to receive one or more control inputs and to process the one or more control inputs to generate a digital signal output, wherein each of the one or more differentiable digital signal processors is differentiable from the digital signal output to the one or more control inputs; a machine-learned model configured to receive a model input and to process the model input to generate the one or more control inputs for the one or more differentiable digital signal processors, wherein the machine-learned model has been trained by backpropagating a loss through the one or more differentiable digital signal processors; and instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: rec
Convolutional networks [CNN, ConvNet] · CPC title
Supervised learning · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Generative networks · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.