Speech separation model training method and apparatus, storage medium and computer device

US11908455B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11908455-B2
Application numberUS-202217672565-A
CountryUS
Kind codeB2
Filing dateFeb 15, 2022
Priority dateJan 7, 2020
Publication dateFeb 20, 2024
Grant dateFeb 20, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech separation model training method and apparatus, a computer-readable storage medium, and a computer device are provided, the method including: obtaining first audio and second audio, the first audio including target audio and having corresponding labeled audio, and the second audio including noise audio. obtaining an encoding model, an extraction model, and an initial estimation model; performing unsupervised training on the encoding model, the extraction model, and the estimation model according to the second audio, and adjusting model parameters of the extraction model and the estimation model; performing supervised training on the encoding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio, and adjusting a model parameter of the encoding model; continuously performing the unsupervised training and the supervised training, so that the unsupervised training and the supervised training overlap, and the training is not finished until a training stop condition is met.

First claim

Opening claim text (preview).

What is claimed is: 1. A speech separation model training method performed by a computer device, the method comprising: obtaining first audio and second audio, the first audio comprising target audio and having corresponding labeled audio, and the second audio comprising noise audio; obtaining an encoding model, an extraction model, and an initial estimation model, an output of the encoding model being an input of the extraction model, the output of the encoding model and an output of the extraction model being jointly inputs of the estimation model; performing unsupervised training on the encoding model, the extraction model, and the estimation model according to the second audio, and adjusting model parameters of the extraction model and the estimation model; performing supervised training on the encoding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio, and adjusting a model parameter of the encoding model; and continuously performing the unsupervised training and the supervised training in an alternating manner until a training stop condition is met. 2. The method according to claim 1 , further comprising: performing Fourier Transform (FT) on the first audio, to obtain an audio feature of the first audio; encoding the audio feature through the encoding model, to obtain an embedding feature of the first audio; extracting the embedding feature through the extraction model, to obtain an abstract feature of the target audio comprised in the first audio; and constructing a supervised training loss function to pre-train the encoding model and the extraction model according to the labeled audio corresponding to the first audio, the embedding feature of the first audio, and the abstract feature of the target audio comprised in the first audio. 3. The method according to claim 2 , wherein the performing FT on the first audio, to obtain an audio feature of the first audio comprises: performing short-time Fourier transform (STFT) on the first audio, to obtain time-frequency points of the first audio; and obtaining a time-frequency feature formed by the time-frequency points as the audio feature of the first audio. 4. The method according to claim 3 , wherein the extracting the embedding feature through the extraction model, to obtain an abstract feature of the target audio comprised in the first audio comprises: processing the embedding feature through a first hidden layer of the extraction model, to obtain a prediction probability that the time-frequency points of the first audio are time-frequency points of the target audio; and operating embedding features of the time-frequency points and prediction probabilities of the time-frequency points through a second hidden layer of the extraction model, to construct a time-varying abstract feature of the target audio comprised in the first audio. 5. The method according to claim 2 , wherein the constructing a supervised training loss function to pre-train the encoding model and the extraction model according to the labeled audio of the first audio, the embedding feature of the first audio, and the abstract feature of the target audio comprised in the first audio comprises: determining a spectrum mask of the target audio comprised in the first audio according to the embedding feature of the first audio and the abstract feature of the target audio comprised in the first audio; reconstructing the target audio based on the spectrum mask; and constructing the supervised training loss function to pre-train the encoding model and the extraction model according to a difference between the reconstructed target audio and the labeled audio of the first audio. 6. The method according to claim 1 , wherein the performing unsupervised training on the encoding model, the extraction model, and the estimation model according to the second audio, and adjusting model parameters of the extraction model and the estimation model comprises: encoding an audio feature of the second audio through the encoding model, to obtain an embedding feature of the second audio; extracting the embedding feature of the second audio through the extraction model, to obtain an abstract feature of the target audio comprised in the second audio; processing the embedding feature of the second audio and the abstract feature of the target audio comprised in the second audio through the estimation model, to obtain a mutual information (MI) estimation feature between the second audio and the abstract feature of the target audio comprised in the second audio; constructing an unsupervised training loss function according to the MI estimation feature; and fixing the model parameter of the encoding model, and adjusting the model parameters of the extraction model and the estimation model according to a direction of minimizing the unsupervised training loss function. 7. The method according to claim 6 , wherein the extracting the embedding feature of the second audio through the extraction model, to obtain an abstract feature of the target audio comprised in the second audio comprises: processing the embedding feature of the second audio through the first hidden layer of the extraction model, to obtain a prediction probability that time-frequency points of the second audio are the time-frequency points of the target audio; and chronologically operating embedding features of the time-frequency points and prediction probabilities of the time-frequency points through the second hidden layer of the extraction model, to construct a global abstract feature of the target audio comprised in the second audio. 8. The method according to claim 7 , wherein the constructing an unsupervised training loss function according to the MI estimation feature comprises: dividing first time-frequency points predicted to be positive samples according to the prediction probabilities of the time-frequency points; obtaining second time-frequency points used as negative samples, the second time-frequency points being taken from a noise proposal distribution obeyed by time-frequency points of pure noise audio; and constructing the unsupervised training loss function according to an MI estimation feature corresponding to the first time-frequency points and an MI estimation feature corresponding to the second time-frequency points. 9. The method according to claim 1 , wherein the performing supervised training on the encoding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio, and adjusting a model parameter of the encoding model comprises: encoding an audio feature of the first audio through the encoding model, to obtain an embedding feature of the first audio; extracting the embedding feature of the first audio through the extraction model, to obtain an abstract feature of the target audio comprised in the first audio; constructing a supervised training loss function according to the labeled audio of the first audio, the embedding feature of the first audio, and the abstract feature of the target audio comprised in the first audio; and fixing the model parameter of the extraction model, and adjusting the model parameter of the encoding model and the estimation model according to a direction of minimizing the supervised training loss function. 10. The method according to claim 1 , further comprising: obtaining mixed audio on which speech separation is to be performed; processing an audio feature of the mixed audio through the encoding model obtained after finishing the unsupervised training and the supervised training, to obtain an embedding feature of the mixed audio; processing the embedding

Assignees

Inventors

Classifications

  • G10L15/063Primary

    Training · CPC title

  • Word boundary detection · CPC title

  • using artificial neural networks · CPC title

  • characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques · CPC title

  • Engine management systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11908455B2 cover?
A speech separation model training method and apparatus, a computer-readable storage medium, and a computer device are provided, the method including: obtaining first audio and second audio, the first audio including target audio and having corresponding labeled audio, and the second audio including noise audio. obtaining an encoding model, an extraction model, and an initial estimation model; …
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L15/063. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 20 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).