Generating speech in the voice of a player of a video game
US-11790884-B1 · Oct 17, 2023 · US
US12198673B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12198673-B2 |
| Application number | US-202117525814-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 12, 2021 |
| Priority date | Nov 12, 2021 |
| Publication date | Jan 14, 2025 |
| Grant date | Jan 14, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure describes techniques for differentiable wavetable synthesizer. The techniques comprise extracting features from a dataset of sounds, wherein the features comprise at least timbre embedding; input the features to the first machine learning model, wherein the first machine learning model is configured to extract a set of N×L learnable parameters, N represents a number of wavetables, and L represents a wavetable length; outputting a plurality of wavetables, wherein each of plurality of wavetables comprises a waveform associated with a unique timbre, the plurality of wavetables form a dictionary, and the plurality of wavetables are portable to perform audio-related tasks. Finally, the said wavetables are used to initialize another machine learning model so as to help reduce computational complexity of an audio synthesis obtained as output of the another machine learning model.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: extracting features from a dataset of sounds, wherein the features comprise at least timbre embedding; inputting the features to a first machine learning model; generating a plurality of wavetables by the first machine learning model based on the features, wherein each of plurality of wavetables comprises a waveform associated with a unique timbre; initializing another machine learning model with at least one subset of the plurality of wavetables, wherein the another machine learning model is configured to reduce a computational complexity of audio synthesis; and generating an audio item based on data output from the another machine learning model. 2. The method of claim 1 , further comprising: producing an audio item based at least in part on at least one subset of the plurality of wavetables. 3. The method of claim 2 , wherein the another machine learning model comprises a third machine learning model, and wherein the method further comprises: training the third machine learning model on a short piece of new audio item, wherein the third machine learning model is initialized with the plurality of wavetables. 4. The method of claim 3 , further comprising: producing the audio item using the third machine learning model, wherein the third machine learning model outputs only time-varying attention weights associated with the at least one subset of the plurality of wavetables. 5. The method of claim 2 , further comprising: specifying a time-varying timbre vector; and producing the audio item based on the specified time-varying timbre vector and the at least one subset of the plurality of wavetables. 6. The method of claim 2 , wherein the another machine learning model comprises a second machine learning model, and wherein the method further comprises: producing the audio item using the second machine learning model, wherein the second machine learning model is initialized with the at least one subset of the plurality of wavetables, and wherein the second machine learning model outputs only data indicative of a linear combination of the at least one subset of the plurality of wavetables. 7. The method of claim 1 , wherein the first machine learning model outputs the plurality of wavetables, linear attentions and amplitudes of the plurality of wavetables. 8. The method of claim 1 , wherein the plurality of wavetables enable to reduce a number of control dimensions of audio synthesis. 9. A method, comprising: obtaining at least one subset of a plurality of wavetables, wherein each of the plurality of wavetables comprises a waveform associated with a unique timbre, wherein the plurality of wavetables are generated by a first machine learning model based on input features, and wherein the input features comprise at least timbre embedding extracted from a dataset of sounds; initializing another machine learning model with the at least one subset of the plurality of wavetables, wherein the another machine learning model is configured to reduce a computational complexity of audio synthesis; and producing an audio item based on data output from the another machine learning model. 10. A system, comprising: at least one processor; and at least one memory communicatively coupled to the at least one processor and storing instructions that upon execution by the at least one processor cause the system to perform operations, the operations comprising: extracting features from a dataset of sounds, wherein the features comprise at least timbre embedding; input the features to a first machine learning model; generating a plurality of wavetables by the first machine learning model based on the features, wherein each of plurality of wavetables comprises a waveform associated with a unique timbre; initializing another machine learning model with at least one subset of the plurality of wavetables, wherein the another machine learning model is configured to reduce a computational complexity of audio synthesis; and generating an audio item based on data output from the another machine learning model. 11. The system of claim 10 , the operations further comprising: producing an audio item based at least in part on at least one subset of the plurality of wavetables. 12. The system of claim 11 , wherein the another machine learning model comprises a third machine learning model, and wherein the operations further comprise: training the third machine learning model on a short piece of new audio item, wherein the third machine learning model is initialized with the plurality of wavetables. 13. The system of claim 12 , the operations further comprising: producing the audio item using the third machine learning model, wherein the third machine learning model outputs only time-varying attention weights associated with the at least one subset of the plurality of wavetables. 14. The system of claim 11 , the operations further comprising: specifying a time-varying timbre vector; and producing the audio item based on the specified a time-varying timbre vector and the at least one subset of the plurality of wavetables. 15. The system of claim 11 , wherein the another machine learning model comprises a second machine learning model, and wherein the operations further comprise: producing the audio item using the second machine learning model, wherein the second machine learning model is initialized with the at least one subset of the plurality of wavetables, and wherein outputs only data indicative of a linear combination of the at least one subset of the plurality of wavetables. 16. The system of claim 11 , wherein the plurality of wavetables enable to reduce a number of control dimensions of audio synthesis. 17. A non-transitory computer-readable storage medium, storing computer-readable instructions that upon execution by a processor cause the processor to implement operations, the operation comprising: extracting features from a dataset of sounds, wherein the features comprise at least timbre embedding; input the features to a first machine learning model; generating a plurality of wavetables by the first machine learning model based on the features, wherein each of plurality of wavetables comprises a waveform associated with a unique timbre; initializing another machine learning model with at least one subset of the plurality of wavetables, wherein the another machine learning model is configured to reduce a computational complexity of audio synthesis; and generating an audio item based on data output from the another machine learning model. 18. The non-transitory computer-readable storage medium of claim 17 , the operations further comprising: producing an audio item based at least in part on at least one subset of the plurality of wavetables. 19. The non-transitory computer-readable storage medium of claim 18 , wherein the another machine learning model comprises a second machine learning model, wherein the operations further comprise: producing the audio item using the second machine learning model, wherein the second machine learning model is initialized with the at least one subset of the plurality of wavetables, and wherein the second machine learning model outputs only data indicative of a linear combination of the at least one subset of the plurality of wavetables. 20. The non-transitory computer-readable storage medium of claim 18 , wherein the another machine learning model comprises a third machine learning model, wherein the operations further co
Segmentation; Word boundary detection · CPC title
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
using Fourier coefficients · CPC title
Pre-filtering or post-filtering · CPC title
Architecture of speech synthesisers · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.