Apparatus, method or computer program for generating a bandwidth-enhanced audio signal using a neural network processor
US-2020243102-A1 · Jul 30, 2020 · US
US2022215848A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022215848-A1 |
| Application number | US-202217703713-A |
| Country | US |
| Kind code | A1 |
| Filing date | Mar 24, 2022 |
| Priority date | May 15, 2020 |
| Publication date | Jul 7, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A voice processing method includes: determining a historical voice frame corresponding to a target voice frame; determining a frequency-domain characteristic of the historical voice frame; invoking a network model to predict the frequency-domain characteristic of the historical voice frame, to obtain a parameter set of the target voice frame, the parameter set including a plurality of types of parameters, the network model including a plurality of neural networks (NNs), and a number of the types of the parameters in the parameter set being determined according to a number of the NNs; and reconstructing the target voice frame according to the parameter set.
Opening claim text (preview).
What is claimed is: 1 . A voice processing method, comprising: determining a historical voice frame corresponding to a target voice frame; determining a frequency-domain characteristic of the historical voice frame; invoking a network model to predict the frequency-domain characteristic of the historical voice frame, to obtain a parameter set of the target voice frame, the parameter set including a plurality of types of parameters, the network model including a plurality of neural networks (NNs), and a number of the types of the parameters in the parameter set being determined according to a number of the NNs; and reconstructing the target voice frame according to the parameter set. 2 . The method according to claim 1 , wherein determining the frequency-domain characteristic of the historical voice frame comprises: performing time-frequency transform on the historical voice frame to obtain a frequency-domain coefficient corresponding to the historical voice frame; and using the frequency-domain coefficient or an amplitude spectrum extracted from the frequency-domain coefficient as the frequency-domain characteristic of the historical voice frame. 3 . The method according to claim 2 , wherein performing the time-frequency transform comprises: performing short-term Fourier transform (STFT) on the historical voice frame, to obtain a plurality of sets of STFT coefficients corresponding to the historical voice frame; and using the frequency-domain coefficient or an amplitude spectrum extracted from the frequency-domain coefficient as the frequency-domain characteristic of the historical voice frame comprises: performing any one of: using the plurality of sets of STFT coefficients as the frequency-domain characteristic of the historical voice frame; and forming an amplitude coefficient sequence according to amplitude spectra corresponding to at least some of the STFT coefficients in each set of STFT coefficients, and using the amplitude coefficient sequence as the frequency-domain characteristic of the historical voice frame. 4 . The method according to claim 1 , wherein the network model includes a first NN and a plurality of second NNs; and invoking the network model comprises: invoking the first NN to predict the frequency-domain characteristic of the historical voice frame, to obtain a virtual frequency-domain characteristic of the target voice frame; invoking the second NNs to predict the virtual frequency-domain characteristic of the target voice frame, to obtain parameters corresponding to the second NNs; and establishing the parameter set of the target voice frame according to the parameters respectively corresponding to the plurality of second NNs. 5 . The method according to claim 4 , wherein the network model includes a third NN; and establishing the parameter set of the target voice frame according to the parameters respectively corresponding to the plurality of second NNs comprises: acquiring an energy parameter of the historical voice frame; invoking the third NN to predict the energy parameter of the historical voice frame, to obtain an energy parameter of the target voice frame; and establishing the parameter set of the target voice frame according to the parameters respectively corresponding to the plurality of second NNs and the energy parameter of the target voice frame, the target voice frame including m subframes, the energy parameter of the target voice frame including a gain value of each of the subframes of the target voice frame, and m being a positive integer. 6 . The method according to claim 1 , wherein reconstructing the target voice frame comprises: establishing a reconstruction filter according to the parameter set; acquiring an excitation signal of the historical voice frame; determining an excitation signal of the target voice frame according to the excitation signal of the historical voice frame; and filtering the excitation signal of the target voice frame according to the reconstruction filter, to obtain a reconstructed target voice frame. 7 . The method according to claim 6 , wherein the target voice frame is an n th voice frame in a voice signal transmitted by a voice over Internet protocol (VoIP) system, the historical voice frame includes an (n−t) th voice frame to an (n−1) th voice frame in the voice signal transmitted by the VoIP system, n and t being both positive integers, and the excitation signal of the historical voice frame includes an excitation signal of the (n−1) th voice frame; and determining the excitation signal of the target voice frame comprises determining the excitation signal of the (n−1) th voice frame as the excitation signal of the target voice frame. 8 . The method according to claim 6 , wherein the target voice frame is an n th voice frame in a voice signal transmitted by a VoIP system, the historical voice frame includes an (n−t) th voice frame to an (n−1) th voice frame in the voice signal transmitted by the VoIP system, n and t being both positive integers, and the excitation signal of the historical voice frame includes an excitation signal of each voice frame in the (n−t) th voice frame to the (n−1) th voice frame; and determining the excitation signal of the target voice frame comprises: averaging the excitation signals of the voice frames in the (n−t) th voice frame to the (n−1) th voice frame to obtain the excitation signal of the target voice frame; or performing weighted summation on the excitation signals of the voice frames in the (n−t) th voice frame to the (n−1) th voice frame to obtain the excitation signal of the target voice frame. 9 . The method according to claim 6 , wherein in response to determining that the target voice frame is an unvoiced frame, the parameter set includes a short-term correlation parameter of the target voice frame, and the reconstruction filter includes a linear predictive coding (LPC) filter; the target voice frame including k daughter frames, the short-term correlation parameter of the target voice frame including a line spectral frequency (LSF) of a k th daughter frame of the target voice frame and an interpolation factor of the target voice frame, and k being an integer greater than 1. 10 . The method according to claim 9 , wherein filtering the excitation signal of the target voice frame comprises: performing interpolation according to the LSF of the k th daughter frame and the interpolation factor of the target voice frame, to obtain an LSF of a daughter frame different from the k th daughter frame; determining an LPC coefficient of any one daughter frame according to an LSF of the any one daughter frame; performing LPC filtering according to the excitation signal of the target voice frame and the LPC coefficient of the any one daughter frame, to obtain any one reconstructed daughter frame; and synthesizing the k reconstructed daughter frames to obtain the reconstructed target voice frame. 11 . The method according to claim 10 , wherein the parameter set includes energy parameters respectively corresponding to the k daughter frames of the target voice frame; and the method further comprises: performing signal amplification on the any one reconstructed daughter frame according to the energy parameter of the any one daughter frame. 12 . The method according to claim 6 , wherein in response to determining that the target voice frame is a voiced frame, the parameter set includes a short-term correlation parameter of the target voice frame and a long-term correlation parameter of the target voice frame, and the reconstruction filter includes a long-term predictive (LTP) filter and an LPC
Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters · CPC title
Correction of errors induced by the transmission channel, if related to the coding algorithm · CPC title
Line spectrum pair [LSP] vocoders · CPC title
the extracted parameters being spectral information of each sub-band · CPC title
using neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.