Audio synthesizing method, storage medium and computer equipment
US-2020372896-A1 · Nov 26, 2020 · US
US12424197B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12424197-B2 |
| Application number | US-202118252186-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 23, 2021 |
| Priority date | Jan 20, 2021 |
| Publication date | Sep 23, 2025 |
| Grant date | Sep 23, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A custom tone and vocal synthesis method and apparatus, an electronic device, and a storage medium. The synthesis method comprises: training a first neural network by means of a speaker record sample to obtain a speaker recognition model, the output training result of the first neural network being a speaker vector sample (S 102 ); training a second neural network by means of an unaccompanied vocal singing sample and the speaker vector sample to obtain an unaccompanied singing synthesis model (S 104 ); inputting a speaker record to be synthesized into the speaker recognition model to obtain speaker information output by the intermediate hidden layer of the speaker recognition model (S 106 ); and inputting unaccompanied singing music information to be synthesized and the speaker information into the unaccompanied singing synthesis model to obtain a synthesized custom tone and vocal (S 108 ).
Opening claim text (preview).
What is claimed is: 1. A method for synthesizing a customized timbre vocal, wherein the method comprises: training a first neural network by means of a speaker record sample to obtain a speaker recognition model, a training result output by the first neural network being a speaker vector sample; training a second neural network by means of an unaccompanied singing vocal sample and the speaker vector sample to obtain an unaccompanied singing synthesis model; inputting a speaker record to be synthesized into the speaker recognition model, and acquiring speaker information output by an intermediate hidden layer of the speaker recognition model; and inputting unaccompanied singing music information to be synthesized and the speaker information into the unaccompanied singing synthesis model to obtain a synthesized customized timbre vocal. 2. The method for synthesizing a customized timbre vocal according to claim 1 , wherein the training the first neural network by means of the speaker record sample to obtain the speaker recognition model, comprises: dividing the speaker record sample into a test record sample and a registered record sample, and inputting the test record sample and the registered record sample into the first neural network; outputting a registered record feature through the first neural network based on the registered record sample, and performing a mean-pooling on the registered record feature to obtain a registered record vector; outputting a test record vector through the first neural network based on the test record sample; performing a cosine similarity calculation on the registered record vector and the test record vector to obtain a cosine similarity result; performing a parameter optimization on the first neural network through the cosine similarity result and a regression function until a loss value of the regression function is minimum; and determining the first neural network after the parameter optimization as the speaker recognition model. 3. The method for synthesizing a customized timbre vocal according to claim 1 , wherein the unaccompanied singing synthesis model comprises a duration model, an acoustic model and a vocoder model, and the training the second neural network by means of the unaccompanied singing vocal sample and the speaker vector sample to obtain the unaccompanied singing synthesis model, comprises: analyzing a music score sample, a lyric sample and a phoneme duration sample in the unaccompanied singing vocal sample; and training the duration model by means of the speaker vector sample, the music score sample, the lyrics sample and the phoneme duration sample, an output result of the duration model being a duration prediction sample. 4. The method for synthesizing a customized timbre vocal according to claim 1 , wherein the unaccompanied singing synthesis model comprises a duration model, an acoustic model and a vocoder model, and the training the second neural network by means of the unaccompanied singing vocal sample and the speaker vector sample to obtain the unaccompanied singing synthesis model, comprises: analyzing a music score sample, a lyric sample and a phoneme duration sample in the unaccompanied singing vocal sample; extracting a Mel spectrogram sample according to a song in the unaccompanied singing vocal sample; and training the acoustic model by means of the speaker vector sample, the phoneme duration sample, the music score sample, the lyrics sample and the Mel spectrogram sample, an output result of the acoustic model being a Mel spectrogram prediction sample. 5. The method for synthesizing a customized timbre vocal according to claim 1 , wherein the unaccompanied singing synthesis model comprises a duration model, an acoustic model and a vocoder model, and the training the second neural network by means of the unaccompanied singing vocal sample and the speaker vector sample to obtain the unaccompanied singing synthesis model, comprises: extracting a Mel spectrogram sample according to a song in the unaccompanied singing vocal sample; and training the vocoder model by means of the Mel spectrogram sample, an output result of the vocoder model being an audio prediction sample. 6. The method for synthesizing a customized timbre vocal according to claim 1 , wherein the unaccompanied singing synthesis model comprises a duration model, an acoustic model and a vocoder model, and the inputting the unaccompanied singing music information to be synthesized and the speaker information into the unaccompanied singing synthesis model to obtain the synthesized customized timbre vocal, comprises: analyzing a music score to be synthesized and a lyric to be synthesized in the unaccompanied singing music information; inputting the speaker information, the music score to be synthesized and the lyric to be synthesized into the duration model, an output result of the duration model being a duration prediction result to be synthesized; inputting the duration prediction result, the speaker information, the music score to be synthesized and the lyric to be synthesized into the acoustic model, an output result of the acoustic model being a Mel spectrogram prediction result to be synthesized; and inputting the Mel spectrogram prediction result into the vocoder model, an output result of the vocoder model being the synthesized customized timbre vocal. 7. The method for synthesizing a customized timbre vocal according to claim 6 , wherein the analyzing the music score to be synthesized and the lyric to be synthesized in the unaccompanied singing music information, comprises: performing a text analysis and a feature extraction on a music score and a lyric in the unaccompanied singing music information to acquire the music score to be synthesized and the lyric to be synthesized. 8. The method for synthesizing a customized timbre vocal according to claim 6 , wherein the inputting the duration prediction result, the speaker information, the music score to be synthesized and the lyric to be synthesized into the acoustic model, the output result of the acoustic model being a Mel spectrogram prediction result to be synthesized, comprises: performing a frame-level extension on the duration prediction result, the music score to be synthesized and the lyric to be synthesized; and inputting a result of the frame-level extension and the speaker information into the acoustic model, the output result of the acoustic model being the Mel spectrogram prediction result to be synthesized. 9. An electronic device, comprising: a processor; and a memory, configured to store executable instructions of the processor; wherein by executing the executable instructions, the processor is configured to: train a first neural network by means of a speaker record sample to obtain a speaker recognition model, a training result output by the first neural network being a speaker vector sample; train a second neural network by means of an unaccompanied singing vocal sample and the speaker vector sample to obtain an unaccompanied singing synthesis model; input a speaker record to be synthesized into the speaker recognition model, and acquire speaker information output by an intermediate hidden layer of the speaker recognition model; and input unaccompanied singing music information to be synthesized and the speaker information into the unaccompanied singing synthesis model to obtain a synthesized customized timbre vocal. 10. The electronic device according to claim 9 , wherein the processor is further configured to: divide the speaker record sample into a test record sample and a registered record sample, and input the test record sample and the registered record sample into the first n
using artificial neural networks · CPC title
Training · CPC title
Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title
Voice editing, e.g. manipulating the voice of the synthesiser · CPC title
Engine management systems · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.