Voice processing method and apparatus, electronic device, and computer-readable storage medium

US12223972B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12223972-B2
Application numberUS-202217714485-A
CountryUS
Kind codeB2
Filing dateApr 6, 2022
Priority dateMay 15, 2020
Publication dateFeb 11, 2025
Grant dateFeb 11, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A voice processing method includes: determining a historical voice frame corresponding to a target voice frame; acquiring a frequency-domain characteristic of the historical voice frame and a time-domain parameter of the historical voice frame; obtaining a parameter set of the target voice frame according to a correlation between the frequency-domain characteristic of the historical voice frame and the time-domain parameter of the historical voice frame, the parameter set including at least two parameters; and reconstructing the target voice frame according to the parameter set.

First claim

Opening claim text (preview).

What is claimed is: 1. A voice processing method, comprising: determining a historical voice frame corresponding to a target voice frame; acquiring a frequency-domain characteristic of the historical voice frame and a time-domain parameter of the historical voice frame; obtaining a parameter set of the target voice frame according to a correlation between the frequency-domain characteristic of the historical voice frame and the time-domain parameter of the historical voice frame, the parameter set including at least two parameters; and reconstructing the target voice frame according to the parameter set, wherein the network model includes a first neural network and at least two second neural networks, the first neural network is in a cascade relationship with each of the second neural networks, and one of the second neural networks corresponding to a parameter in the parameter set, and obtaining the parameter set of the target voice frame comprises: invoking the first neural network based on the frequency-domain characteristic of the historical voice frame, to obtain a virtual frequency-domain characteristic of the target voice frame; and inputting the virtual frequency-domain characteristic of the target voice frame and the time-domain parameter of the historical voice frame to the at least two second neural networks respectively as input information, to obtain the at least two parameters in the parameter set of the target voice frame. 2. The method according to claim 1 , wherein reconstructing the target voice frame comprises: establishing a reconstruction filter according to the parameter set; acquiring an excitation signal of the target voice frame; and filtering the excitation signal of the target voice frame by using the reconstruction filter to obtain the target voice frame. 3. The method according to claim 2 , wherein acquiring the excitation signal of the target voice frame comprises: acquiring an excitation signal of the historical voice frame; and determining the excitation signal of the target voice frame according to the excitation signal of the historical voice frame. 4. The method according to claim 3 , wherein the target voice frame is an nth voice frame in a voice signal transmitted by a Voice over Internet Protocol (VOIP) system, and the historical voice frame includes an (n−t) th voice frame to an (n−1) th voice frame in the voice signal transmitted by the VoIP system, n and t being both positive integers. 5. The method according to claim 4 , wherein the excitation signal of the historical voice frame includes an excitation signal of the (n−1) th voice frame; and determining the excitation signal of the target voice frame comprises: determining the excitation signal of the (n−1) th voice frame as the excitation signal of the target voice frame. 6. The method according to claim 4 , wherein the excitation signal of the historical voice frame includes excitation signals of voice frames from the (n−t) th voice frame to the (n−1) th voice frame; and determining the excitation signal of the target voice frame comprises: averaging the excitation signals of the (n−t) th voice frame to the (n−1) th voice frame to obtain the excitation signal of the target voice frame. 7. The method according to claim 4 , wherein the excitation signal of the historical voice frame includes excitation signals of voice frames from the (n−t) th voice frame to the (n−1) th voice frame; and determining the excitation signal of the target voice frame comprises: performing weighted summation on the excitation signals of the (n−t) th voice frame to the (n−1) th voice frame to obtain the excitation signal of the target voice frame. 8. The method according to claim 1 , wherein acquiring the frequency-domain characteristic comprises: performing short-term Fourier transform (STFT) on the historical voice frame to obtain a frequency-domain coefficient corresponding to the historical voice frame; and extracting an amplitude spectrum from the frequency-domain coefficient corresponding to the historical voice frame as the frequency-domain characteristic of the historical voice frame. 9. The method according to claim 2 , wherein in response to determining that the target voice frame is an unvoiced frame, the time-domain parameter of the historical voice frame includes a short-term correlation parameter of the historical voice frame, and the parameter set includes a short-term correlation parameter of the target voice frame; and the reconstruction filter includes a linear predictive coding (LPC) filter; and the target voice frame includes k daughter frames, and the short-term correlation parameter of the target voice frame includes a line spectral frequency (LSF) and an interpolation factor of a k th daughter frame of the target voice frame, k being an integer greater than 1. 10. The method according to claim 2 , wherein in response to determining that the target voice frame is a voiced frame, the time-domain parameter of the historical voice frame includes a short-term correlation parameter and a long-term correlation parameter of the historical voice frame, and the parameter set includes a short-term correlation parameter of the target voice frame and a long-term correlation parameter of the target voice frame; the reconstruction filter includes a long-term prediction (LTP) filter and an LPC filter; the target voice frame includes k daughter frames, and the short-term correlation parameter of the target voice frame includes an LSF and an interpolation factor of a k th daughter frame of the target voice frame, k being an integer greater than 1; and the target voice frame includes m subframes, and the long-term correlation parameter of the target voice frame includes a pitch lag and an LTP coefficient of each subframe of the target voice frame, m being a positive integer. 11. The method according to claim 1 , wherein the network model further includes a third neural network, a network formed by the first neural network, each of the second neural networks, and the third neural network being a parallel network; and the time-domain parameter of the historical voice frame includes an energy parameter of the historical voice frame; and obtaining the parameter set of the target voice frame comprises: inputting the virtual frequency-domain characteristic of the target voice frame and the energy parameter of the historical voice frame to the at least two second neural networks respectively as input information, to obtain the at least two parameters of the target voice frame; invoking the third neural network based on the energy parameter of the historical voice frame, to obtain an energy parameter of the target voice frame; and forming the parameter set of the target voice frame by the at least two parameters of the target voice frame and the energy parameter of the target voice frame, the target voice frame including m subframes, and the energy parameter of the target voice frame including a gain value of each of the subframes of the target voice frame. 12. A voice processing apparatus, comprising: a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions and perform: determining a historical voice frame corresponding to a target voice frame; acquiring a frequency-domain characteristic of the historical voice frame and a time-domain parameter of the historical voice frame; obtaining a parameter set of the target voice frame according to a correlation between the frequency-domain characteristic of the historical voice frame and the time-domain parameter of the historical voic

Assignees

Inventors

Classifications

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Feedforward networks · CPC title

  • Supervised learning · CPC title

  • Combinations of networks · CPC title

  • Responding to QoS · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12223972B2 cover?
A voice processing method includes: determining a historical voice frame corresponding to a target voice frame; acquiring a frequency-domain characteristic of the historical voice frame and a time-domain parameter of the historical voice frame; obtaining a parameter set of the target voice frame according to a correlation between the frequency-domain characteristic of the historical voice frame…
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L21/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).