Custom tone and vocal synthesis method and apparatus, electronic device, and storage medium

US12424197B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12424197-B2
Application numberUS-202118252186-A
CountryUS
Kind codeB2
Filing dateDec 23, 2021
Priority dateJan 20, 2021
Publication dateSep 23, 2025
Grant dateSep 23, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A custom tone and vocal synthesis method and apparatus, an electronic device, and a storage medium. The synthesis method comprises: training a first neural network by means of a speaker record sample to obtain a speaker recognition model, the output training result of the first neural network being a speaker vector sample (S 102 ); training a second neural network by means of an unaccompanied vocal singing sample and the speaker vector sample to obtain an unaccompanied singing synthesis model (S 104 ); inputting a speaker record to be synthesized into the speaker recognition model to obtain speaker information output by the intermediate hidden layer of the speaker recognition model (S 106 ); and inputting unaccompanied singing music information to be synthesized and the speaker information into the unaccompanied singing synthesis model to obtain a synthesized custom tone and vocal (S 108 ).

First claim

Opening claim text (preview).

What is claimed is: 1. A method for synthesizing a customized timbre vocal, wherein the method comprises: training a first neural network by means of a speaker record sample to obtain a speaker recognition model, a training result output by the first neural network being a speaker vector sample; training a second neural network by means of an unaccompanied singing vocal sample and the speaker vector sample to obtain an unaccompanied singing synthesis model; inputting a speaker record to be synthesized into the speaker recognition model, and acquiring speaker information output by an intermediate hidden layer of the speaker recognition model; and inputting unaccompanied singing music information to be synthesized and the speaker information into the unaccompanied singing synthesis model to obtain a synthesized customized timbre vocal. 2. The method for synthesizing a customized timbre vocal according to claim 1 , wherein the training the first neural network by means of the speaker record sample to obtain the speaker recognition model, comprises: dividing the speaker record sample into a test record sample and a registered record sample, and inputting the test record sample and the registered record sample into the first neural network; outputting a registered record feature through the first neural network based on the registered record sample, and performing a mean-pooling on the registered record feature to obtain a registered record vector; outputting a test record vector through the first neural network based on the test record sample; performing a cosine similarity calculation on the registered record vector and the test record vector to obtain a cosine similarity result; performing a parameter optimization on the first neural network through the cosine similarity result and a regression function until a loss value of the regression function is minimum; and determining the first neural network after the parameter optimization as the speaker recognition model. 3. The method for synthesizing a customized timbre vocal according to claim 1 , wherein the unaccompanied singing synthesis model comprises a duration model, an acoustic model and a vocoder model, and the training the second neural network by means of the unaccompanied singing vocal sample and the speaker vector sample to obtain the unaccompanied singing synthesis model, comprises: analyzing a music score sample, a lyric sample and a phoneme duration sample in the unaccompanied singing vocal sample; and training the duration model by means of the speaker vector sample, the music score sample, the lyrics sample and the phoneme duration sample, an output result of the duration model being a duration prediction sample. 4. The method for synthesizing a customized timbre vocal according to claim 1 , wherein the unaccompanied singing synthesis model comprises a duration model, an acoustic model and a vocoder model, and the training the second neural network by means of the unaccompanied singing vocal sample and the speaker vector sample to obtain the unaccompanied singing synthesis model, comprises: analyzing a music score sample, a lyric sample and a phoneme duration sample in the unaccompanied singing vocal sample; extracting a Mel spectrogram sample according to a song in the unaccompanied singing vocal sample; and training the acoustic model by means of the speaker vector sample, the phoneme duration sample, the music score sample, the lyrics sample and the Mel spectrogram sample, an output result of the acoustic model being a Mel spectrogram prediction sample. 5. The method for synthesizing a customized timbre vocal according to claim 1 , wherein the unaccompanied singing synthesis model comprises a duration model, an acoustic model and a vocoder model, and the training the second neural network by means of the unaccompanied singing vocal sample and the speaker vector sample to obtain the unaccompanied singing synthesis model, comprises: extracting a Mel spectrogram sample according to a song in the unaccompanied singing vocal sample; and training the vocoder model by means of the Mel spectrogram sample, an output result of the vocoder model being an audio prediction sample. 6. The method for synthesizing a customized timbre vocal according to claim 1 , wherein the unaccompanied singing synthesis model comprises a duration model, an acoustic model and a vocoder model, and the inputting the unaccompanied singing music information to be synthesized and the speaker information into the unaccompanied singing synthesis model to obtain the synthesized customized timbre vocal, comprises: analyzing a music score to be synthesized and a lyric to be synthesized in the unaccompanied singing music information; inputting the speaker information, the music score to be synthesized and the lyric to be synthesized into the duration model, an output result of the duration model being a duration prediction result to be synthesized; inputting the duration prediction result, the speaker information, the music score to be synthesized and the lyric to be synthesized into the acoustic model, an output result of the acoustic model being a Mel spectrogram prediction result to be synthesized; and inputting the Mel spectrogram prediction result into the vocoder model, an output result of the vocoder model being the synthesized customized timbre vocal. 7. The method for synthesizing a customized timbre vocal according to claim 6 , wherein the analyzing the music score to be synthesized and the lyric to be synthesized in the unaccompanied singing music information, comprises: performing a text analysis and a feature extraction on a music score and a lyric in the unaccompanied singing music information to acquire the music score to be synthesized and the lyric to be synthesized. 8. The method for synthesizing a customized timbre vocal according to claim 6 , wherein the inputting the duration prediction result, the speaker information, the music score to be synthesized and the lyric to be synthesized into the acoustic model, the output result of the acoustic model being a Mel spectrogram prediction result to be synthesized, comprises: performing a frame-level extension on the duration prediction result, the music score to be synthesized and the lyric to be synthesized; and inputting a result of the frame-level extension and the speaker information into the acoustic model, the output result of the acoustic model being the Mel spectrogram prediction result to be synthesized. 9. An electronic device, comprising: a processor; and a memory, configured to store executable instructions of the processor; wherein by executing the executable instructions, the processor is configured to: train a first neural network by means of a speaker record sample to obtain a speaker recognition model, a training result output by the first neural network being a speaker vector sample; train a second neural network by means of an unaccompanied singing vocal sample and the speaker vector sample to obtain an unaccompanied singing synthesis model; input a speaker record to be synthesized into the speaker recognition model, and acquire speaker information output by an intermediate hidden layer of the speaker recognition model; and input unaccompanied singing music information to be synthesized and the speaker information into the unaccompanied singing synthesis model to obtain a synthesized customized timbre vocal. 10. The electronic device according to claim 9 , wherein the processor is further configured to: divide the speaker record sample into a test record sample and a registered record sample, and input the test record sample and the registered record sample into the first n

Assignees

Inventors

Classifications

  • using artificial neural networks · CPC title

  • Training · CPC title

  • Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination · CPC title

  • Voice editing, e.g. manipulating the voice of the synthesiser · CPC title

  • Engine management systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12424197B2 cover?
A custom tone and vocal synthesis method and apparatus, an electronic device, and a storage medium. The synthesis method comprises: training a first neural network by means of a speaker record sample to obtain a speaker recognition model, the output training result of the first neural network being a speaker vector sample (S 102 ); training a second neural network by means of an unaccompanied v…
Who is the assignee on this patent?
Beijing Wodong Tianjun Information Technology Co Ltd, Beijing Jingdong Century Trading Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L13/027. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 23 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).