Method, apparatus and system for hybrid speech synthesis
US-2022059107-A1 · Feb 24, 2022 · US
US12223972B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12223972-B2 |
| Application number | US-202217714485-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 6, 2022 |
| Priority date | May 15, 2020 |
| Publication date | Feb 11, 2025 |
| Grant date | Feb 11, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A voice processing method includes: determining a historical voice frame corresponding to a target voice frame; acquiring a frequency-domain characteristic of the historical voice frame and a time-domain parameter of the historical voice frame; obtaining a parameter set of the target voice frame according to a correlation between the frequency-domain characteristic of the historical voice frame and the time-domain parameter of the historical voice frame, the parameter set including at least two parameters; and reconstructing the target voice frame according to the parameter set.
Opening claim text (preview).
What is claimed is: 1. A voice processing method, comprising: determining a historical voice frame corresponding to a target voice frame; acquiring a frequency-domain characteristic of the historical voice frame and a time-domain parameter of the historical voice frame; obtaining a parameter set of the target voice frame according to a correlation between the frequency-domain characteristic of the historical voice frame and the time-domain parameter of the historical voice frame, the parameter set including at least two parameters; and reconstructing the target voice frame according to the parameter set, wherein the network model includes a first neural network and at least two second neural networks, the first neural network is in a cascade relationship with each of the second neural networks, and one of the second neural networks corresponding to a parameter in the parameter set, and obtaining the parameter set of the target voice frame comprises: invoking the first neural network based on the frequency-domain characteristic of the historical voice frame, to obtain a virtual frequency-domain characteristic of the target voice frame; and inputting the virtual frequency-domain characteristic of the target voice frame and the time-domain parameter of the historical voice frame to the at least two second neural networks respectively as input information, to obtain the at least two parameters in the parameter set of the target voice frame. 2. The method according to claim 1 , wherein reconstructing the target voice frame comprises: establishing a reconstruction filter according to the parameter set; acquiring an excitation signal of the target voice frame; and filtering the excitation signal of the target voice frame by using the reconstruction filter to obtain the target voice frame. 3. The method according to claim 2 , wherein acquiring the excitation signal of the target voice frame comprises: acquiring an excitation signal of the historical voice frame; and determining the excitation signal of the target voice frame according to the excitation signal of the historical voice frame. 4. The method according to claim 3 , wherein the target voice frame is an nth voice frame in a voice signal transmitted by a Voice over Internet Protocol (VOIP) system, and the historical voice frame includes an (n−t) th voice frame to an (n−1) th voice frame in the voice signal transmitted by the VoIP system, n and t being both positive integers. 5. The method according to claim 4 , wherein the excitation signal of the historical voice frame includes an excitation signal of the (n−1) th voice frame; and determining the excitation signal of the target voice frame comprises: determining the excitation signal of the (n−1) th voice frame as the excitation signal of the target voice frame. 6. The method according to claim 4 , wherein the excitation signal of the historical voice frame includes excitation signals of voice frames from the (n−t) th voice frame to the (n−1) th voice frame; and determining the excitation signal of the target voice frame comprises: averaging the excitation signals of the (n−t) th voice frame to the (n−1) th voice frame to obtain the excitation signal of the target voice frame. 7. The method according to claim 4 , wherein the excitation signal of the historical voice frame includes excitation signals of voice frames from the (n−t) th voice frame to the (n−1) th voice frame; and determining the excitation signal of the target voice frame comprises: performing weighted summation on the excitation signals of the (n−t) th voice frame to the (n−1) th voice frame to obtain the excitation signal of the target voice frame. 8. The method according to claim 1 , wherein acquiring the frequency-domain characteristic comprises: performing short-term Fourier transform (STFT) on the historical voice frame to obtain a frequency-domain coefficient corresponding to the historical voice frame; and extracting an amplitude spectrum from the frequency-domain coefficient corresponding to the historical voice frame as the frequency-domain characteristic of the historical voice frame. 9. The method according to claim 2 , wherein in response to determining that the target voice frame is an unvoiced frame, the time-domain parameter of the historical voice frame includes a short-term correlation parameter of the historical voice frame, and the parameter set includes a short-term correlation parameter of the target voice frame; and the reconstruction filter includes a linear predictive coding (LPC) filter; and the target voice frame includes k daughter frames, and the short-term correlation parameter of the target voice frame includes a line spectral frequency (LSF) and an interpolation factor of a k th daughter frame of the target voice frame, k being an integer greater than 1. 10. The method according to claim 2 , wherein in response to determining that the target voice frame is a voiced frame, the time-domain parameter of the historical voice frame includes a short-term correlation parameter and a long-term correlation parameter of the historical voice frame, and the parameter set includes a short-term correlation parameter of the target voice frame and a long-term correlation parameter of the target voice frame; the reconstruction filter includes a long-term prediction (LTP) filter and an LPC filter; the target voice frame includes k daughter frames, and the short-term correlation parameter of the target voice frame includes an LSF and an interpolation factor of a k th daughter frame of the target voice frame, k being an integer greater than 1; and the target voice frame includes m subframes, and the long-term correlation parameter of the target voice frame includes a pitch lag and an LTP coefficient of each subframe of the target voice frame, m being a positive integer. 11. The method according to claim 1 , wherein the network model further includes a third neural network, a network formed by the first neural network, each of the second neural networks, and the third neural network being a parallel network; and the time-domain parameter of the historical voice frame includes an energy parameter of the historical voice frame; and obtaining the parameter set of the target voice frame comprises: inputting the virtual frequency-domain characteristic of the target voice frame and the energy parameter of the historical voice frame to the at least two second neural networks respectively as input information, to obtain the at least two parameters of the target voice frame; invoking the third neural network based on the energy parameter of the historical voice frame, to obtain an energy parameter of the target voice frame; and forming the parameter set of the target voice frame by the at least two parameters of the target voice frame and the energy parameter of the target voice frame, the target voice frame including m subframes, and the energy parameter of the target voice frame including a gain value of each of the subframes of the target voice frame. 12. A voice processing apparatus, comprising: a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions and perform: determining a historical voice frame corresponding to a target voice frame; acquiring a frequency-domain characteristic of the historical voice frame and a time-domain parameter of the historical voice frame; obtaining a parameter set of the target voice frame according to a correlation between the frequency-domain characteristic of the historical voice frame and the time-domain parameter of the historical voic
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Feedforward networks · CPC title
Supervised learning · CPC title
Combinations of networks · CPC title
Responding to QoS · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.