Voice synthesis method, model training method, device and computer device

US12014720B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12014720-B2
Application numberUS-202016999989-A
CountryUS
Kind codeB2
Filing dateAug 21, 2020
Priority dateJul 25, 2018
Publication dateJun 18, 2024
Grant dateJun 18, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This application relates to a speech synthesis method and apparatus, a model training method and apparatus, and a computer device. The method includes: obtaining to-be-processed linguistic data; encoding the linguistic data, to obtain encoded linguistic data; obtaining an embedded vector for speech feature conversion, the embedded vector being generated according to a residual between synthesized reference speech data and reference speech data that correspond to the same reference linguistic data; and decoding the encoded linguistic data according to the embedded vector, to obtain target synthesized speech data on which the speech feature conversion is performed. The solution provided in this application can prevent quality of a synthesized speech from being affected by a semantic feature in a mel-frequency cepstrum.

First claim

Opening claim text (preview).

What is claimed is: 1. A speech synthesis method performed at a computer device having one or more processors and memory storing one or more programs to be executed by the one or more processors, the method comprising: obtaining linguistic data; encoding the linguistic data, to obtain encoded linguistic data; obtaining reference linguistic data and corresponding target reference speech data; encoding the reference linguistic data, to obtain encoded reference linguistic data; decoding the encoded reference linguistic data, to obtain synthesized reference speech data; determining a residual between the target reference speech data and the synthesized reference speech data; obtaining an embedded vector for speech feature conversion, the embedded vector representing a speaking style feature of a target user and being generated according to the residual between the synthesized reference speech data synthesized from the reference linguistic data different from the linguistic data and the target reference speech data that correspond to the same reference linguistic data; and decoding the encoded linguistic data by performing the speech feature conversion on the encoded linguistic data according to the embedded vector, to obtain target synthesized speech data corresponding to the linguistic data. 2. The method according to claim 1 , further comprising: processing the residual by using a residual model; and generating the embedded vector for speech feature conversion according to a result of a forward operation and a result of a backward operation of the residual model. 3. The method according to claim 2 , wherein the generating the embedded vector for speech feature conversion according to a result of a forward operation and a result of a backward operation of the residual model comprises: obtaining a first vector outputted in the last time step during the forward operation performed by a forward gated recurrent unit (GRU) layer of the residual model; obtaining a second vector outputted in the first time step during the backward operation performed by a backward GRU layer of the residual model; and superposing the first vector and the second vector, to obtain the embedded vector for speech feature conversion. 4. The method according to claim 2 , wherein the processing the residual by using a residual model comprises: processing the residual by using a dense layer, a forward gated recurrent unit (GRU) layer, and a backward GRU layer of the residual model. 5. The method according to claim 2 , wherein the encoded linguistic data is encoded by using a first encoder, the target synthesized speech data is decoded by using a first decoder, the encoded reference linguistic data is encoded by using a second encoder, the synthesized reference speech data is decoded by using a second decoder, and the embedded vector is obtained by using the residual model. 6. The method according to claim 5 , further comprising: obtaining training linguistic data and corresponding training speech data; encoding the training linguistic data by using the second encoder, to obtain second encoded training linguistic data; decoding the second encoded training linguistic data by using the second decoder, to obtain synthesized training speech data; generating a training embedded vector according to a residual between the synthesized training speech data and the training speech data by using the residual model; decoding first encoded training linguistic data according to the training embedded vector, to obtain predicted target synthesized speech data on which the speech feature conversion is performed; and adjusting the second encoder, the second decoder, the residual model, the first encoder, and the first decoder according to a difference between the predicted target synthesized speech data and the training speech data, and continuing to perform training until a training stop condition is satisfied. 7. The method according to claim 1 , wherein the encoded linguistic data is encoded by using a first encoder, the target synthesized speech data is decoded by using a first decoder, and the method further comprises: obtaining training linguistic data and corresponding target training speech data; encoding the training linguistic data by using the first encoder, to obtain first encoded training linguistic data; obtaining a training embedded vector for speech feature conversion, the training embedded vector being generated according to a residual between training speech data synthesized from the training linguistic data and the target training speech data that correspond to the same training linguistic data; decoding the first encoded training linguistic data according to the training embedded vector by using the first decoder, to obtain predicted target synthesized speech data on which the speech feature conversion is performed; and adjusting the first encoder and the first decoder according to a difference between the predicted target synthesized speech data and the target training speech data until a training stop condition is satisfied. 8. The method according to claim 1 , wherein the decoding the encoded linguistic data by performing the speech feature conversion on the encoded linguistic data according to the embedded vector, to obtain target synthesized speech data comprises: splicing the encoded linguistic data and the embedded vector to obtain a spliced vector; and decoding the spliced vector to obtain the target synthesized speech data on which the speech feature conversion is performed. 9. The method according to claim 1 , further comprising: determining a speech amplitude spectrum corresponding to the target synthesized speech data; converting the speech amplitude spectrum into a speech waveform signal in a time domain; and generating a speech according to the speech waveform signal. 10. A computer device, comprising a memory and a processor, the memory storing a plurality of computer programs, the computer programs, when executed by the processor, causing the computer device to perform a plurality of operations including: obtaining linguistic data; encoding the linguistic data, to obtain encoded linguistic data; obtaining reference linguistic data and corresponding target reference speech data; encoding the reference linguistic data, to obtain encoded reference linguistic data; decoding the encoded reference linguistic data, to obtain synthesized reference speech data; determining a residual between the target reference speech data and the synthesized reference speech data; obtaining an embedded vector for speech feature conversion, the embedded vector representing a speaking style feature of a target user and being generated according to the residual between the synthesized reference speech data synthesized from reference linguistic data different from the linguistic data and the target reference speech data that correspond to the same reference linguistic data; and decoding the encoded linguistic data by performing the speech feature conversion on the encoded linguistic data according to the embedded vector, to obtain target synthesized speech data corresponding to the linguistic data. 11. The computer device according to claim 10 , wherein the plurality of operations further comprise: processing the residual by using a residual model; and generating the embedded vector for speech feature conversion according to a result of a forward operation and a result of a backward operation of the residual model. 12. The computer device according to claim 11 , wherein the generating the embedded vector for speech feature conversion according to a result of a forwar

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • using spectral analysis, e.g. transform vocoders or subband vocoders · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12014720B2 cover?
This application relates to a speech synthesis method and apparatus, a model training method and apparatus, and a computer device. The method includes: obtaining to-be-processed linguistic data; encoding the linguistic data, to obtain encoded linguistic data; obtaining an embedded vector for speech feature conversion, the embedded vector being generated according to a residual between synthesiz…
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L13/04. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 18 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).