Audio signal processing method and apparatus, electronic device, and storage medium

US12039995B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12039995-B2
Application numberUS-202217667370-A
CountryUS
Kind codeB2
Filing dateFeb 8, 2022
Priority dateJan 2, 2020
Publication dateJul 16, 2024
Grant dateJul 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This application discloses an audio signal processing method performed by an electronic device. According to this application, embedding processing is performed on a mixed audio signal by mapping the mixed audio signal to an embedding space, to obtain an embedding feature of the mixed audio signal in the embedding space; and generalized feature extraction is performed on the embedding feature, so that a generalized feature of a target component in the mixed audio signal can be obtained through extraction. The generalized feature of the target component has good generalization capability and expression capability, and can be used for different scenarios. Audio signal processing is performed on the mixed audio signal based on the generalized feature of the target component to obtain information of the audio signal of the target object, thereby improving the robustness and generalization of an audio signal processing process, and improving the accuracy of audio signal processing.

First claim

Opening claim text (preview).

What is claimed is: 1. An audio signal processing method performed by an electronic device, the method comprising: performing embedding processing on a mixed audio signal by mapping the mixed audio signal from a low-dimensional space to a high-dimensional embedding space using an encoder network, to obtain an embedding feature of the mixed audio signal in the embedding space; performing generalized feature extraction on the embedding feature using an abstractor network, to obtain a generalized feature of a target component in the mixed audio signal, the target component corresponding to an audio signal of a target object in the mixed audio signal, wherein a dimension of the generalized feature is lower than a dimension of embedding feature of the mixed audio signal; and performing audio signal processing on the mixed audio signal based on the generalized feature of the target component to obtain information of the audio signal of the target object in the mixed audio signal used for separating the audio signal of the target object from the mixed audio signal, wherein the encoder network and the abstractor network are obtained by collaboratively training on a teacher model and a student model through unsupervised machine learning using unlabeled sample mixed signals in multiple iterations, wherein the student model comprises a first encoder network and a first abstractor network, the teacher model comprises a second encoder network and a second abstractor network, an output of the first encoder network is used as an input of the first abstractor network, and an output of the second encoder network is used as an input of the second abstractor network, and the teacher model in each iteration process is obtained by weighting the teacher model in a previous iteration process and the student model in the current iteration process. 2. The method according to claim 1 , wherein the performing generalized feature extraction on the embedding feature, to obtain a generalized feature of a target component in the mixed audio signal comprises: performing recursive weighting processing on the embedding feature, to obtain the generalized feature of the target component. 3. The method according to claim 1 , wherein the abstractor network is an autoregressive model, and the inputting the embedding feature into an abstractor network, and performing generalized feature extraction on the embedding feature by using the abstractor network, to obtain the generalized feature of the target component in the mixed audio signal comprises: inputting the embedding feature into the autoregressive model, and performing recursive weighting processing on the embedding feature by using the autoregressive model, to obtain the generalized feature of the target component. 4. The method according to claim 1 , wherein the collaboratively training on a teacher model and a student model based on an unlabeled sample mixed signal, to obtain the encoder network and the abstractor network comprises: obtaining, in any iteration process, the teacher model in the current iteration process based on the student model in the current iteration process and the teacher model in a previous iteration process; respectively inputting the unlabeled sample mixed signal into the teacher model and the student model in the current iteration process, and respectively outputting a teacher generalized feature and a student generalized feature of a target component in the sample mixed signal; obtaining a loss function value of the current iteration process based on at least one of the sample mixed signal, the teacher generalized feature, or the student generalized feature; adjusting, when the loss function value does not meet a training end condition, a parameter of the student model to obtain the student model in a next iteration process, and performing the next iteration process based on the student model in the next iteration process; and obtaining the encoder network and the abstractor network based on the student model or the teacher model in the current iteration process when the loss function value meets the training end condition. 5. The method according to claim 4 , wherein the obtaining a loss function value of the current iteration process based on at least one of the sample mixed signal, the teacher generalized feature, or the student generalized feature comprises: obtaining a mean squared error (MSE) between the teacher generalized feature and the student generalized feature; obtaining a mutual information (MI) value between the sample mixed signal and the student generalized feature; and determining at least one of the MSE or the MI value as the loss function value of the current iteration process. 6. The method according to claim 5 , wherein the training end condition is that the MSE does not decrease in a first target quantity of consecutive iteration processes; or the training end condition is that the MSE is less than or equal to a first target threshold and the MI value is greater than or equal to a second target threshold; or the training end condition is that a quantity of iterations reaches a second target quantity. 7. The method according to claim 4 , wherein the obtaining the teacher model in the current iteration process based on the student model in the current iteration process and the teacher model in a previous iteration process comprises: multiplying a parameter set of the teacher model in the previous iteration process by a first smoothing coefficient, to obtain a first parameter set; multiplying the student model in the current iteration process by a second smoothing coefficient, to obtain a second parameter set, a value obtained by adding the first smoothing coefficient and the second smoothing coefficient being 1; determining a sum of the first parameter set and the second parameter set as a parameter set of the teacher model in the current iteration process; and performing parameter update on the teacher model in the previous iteration process based on the parameter set of the teacher model in the current iteration process, to obtain the teacher model in the current iteration process. 8. The method according to claim 4 , wherein the obtaining the encoder network and the abstractor network based on the student model or the teacher model in the current iteration process comprises: respectively determining the first encoder network and the first abstractor network in the student model in the current iteration process as the encoder network and the abstractor network; or respectively determining the second encoder network and the second abstractor network in the teacher model in the current iteration process as the encoder network and the abstractor network. 9. The method according to claim 1 , wherein the performing audio signal processing on the mixed audio signal based on the generalized feature of the target component to obtain information of the audio signal of the target object in the mixed audio signal comprises: performing speech-to-text conversion on the audio signal of the target object based on the generalized feature of the target component, and outputting text information corresponding to the audio signal of the target object; or performing voiceprint recognition on the audio signal of the target object based on the generalized feature of the target component, and outputting a voiceprint recognition result corresponding to the audio signal of the target object; or generating a response speech corresponding to the audio signal of the target object based on the generalized feature of the target component, and outputting the response speech. 10. An electronic device, comprising one or more processors and one or more memorie

Assignees

Inventors

Classifications

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • the noise being separate speech, e.g. cocktail party · CPC title

  • Processing in the frequency domain · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12039995B2 cover?
This application discloses an audio signal processing method performed by an electronic device. According to this application, embedding processing is performed on a mixed audio signal by mapping the mixed audio signal to an embedding space, to obtain an embedding feature of the mixed audio signal in the embedding space; and generalized feature extraction is performed on the embedding feature, …
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L21/0272. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).