Multi-person speech separation method and apparatus using a generative adversarial network model

US11450337B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11450337-B2
Application numberUS-202017023829-A
CountryUS
Kind codeB2
Filing dateSep 17, 2020
Priority dateAug 9, 2018
Publication dateSep 20, 2022
Grant dateSep 20, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A multi-person speech separation method is provided for a terminal. The method includes extracting a hybrid speech feature from a hybrid speech signal requiring separation, N human voices being mixed in the hybrid speech signal, N being a positive integer greater than or equal to 2; extracting a masking coefficient of the hybrid speech feature by using a generative adversarial network (GAN) model, to obtain a masking matrix corresponding to the N human voices, wherein the GAN model comprises a generative network model and an adversarial network model; and performing a speech separation on the masking matrix corresponding to the N human voices and the hybrid speech signal by using the GAN model, and outputting N separated speech signals corresponding to the N human voices.

First claim

Opening claim text (preview).

What is claimed is: 1. A multi-person speech separation method for a terminal by using a generative adversarial network (GAN) model, the GAN model including a generative network model and a discriminative network model, the method comprising: obtaining a hybrid speech sample and a clean speech sample from a sample database; extracting a hybrid speech sample feature from the hybrid speech sample; extracting a masking coefficient of the hybrid speech sample feature by using the generative network model, to obtain a sample masking matrix; performing a speech separation on the sample masking matrix and the hybrid speech sample by using the generative network model, and outputting a separated speech sample; performing alternate training on the generative network model and the discriminative network model by using the separated speech sample, the hybrid speech sample, and the clean speech sample, wherein the alternate training is performed by: fixing the generative network model during a current time of training of the discriminative network model; obtaining a loss function of the discriminative network model by using the separated speech sample, the hybrid speech sample, and the clean speech sample; optimizing the discriminative network model by minimizing the loss function of the discriminative network model; fixing the discriminative network model during a next time of training of the generative network model; obtaining a loss function of the generative network model by using the separated speech sample, the hybrid speech sample, and the clean speech sample; and optimizing the generative network model by minimizing the loss function of the generative network model. 2. The method according to claim 1 , wherein the obtaining a loss function of the discriminative network model by using the separated speech sample, the hybrid speech sample, and the clean speech sample comprises: determining a first signal sample combination according to the separated speech sample and the hybrid speech sample, and determining a second signal sample combination according to the clean speech sample and the hybrid speech sample; performing discriminative output on the first signal sample combination by using the discriminative network model to obtain a first discriminative output result, and obtaining a first distortion metric between the first discriminative output result and a first target output of the discriminative network model; performing discriminative output on the second signal sample combination by using the discriminative network model to obtain a second discriminative output result, and obtaining a second distortion metric between the second discriminative output result and a second target output of the discriminative network model; and obtaining the loss function of the discriminative network model according to the first distortion metric and the second distortion metric. 3. The method according to claim 1 , wherein the extracting a hybrid speech feature from a hybrid speech signal requiring separation comprises: extracting a time domain feature or a frequency domain feature of a single-channel speech signal from the hybrid speech signal; extracting a time domain feature or a frequency domain feature of a multi-channel speech signal from the hybrid speech signal; extracting a single-channel speech feature from the hybrid speech signal; or extracting a correlated feature among a plurality of channels from the hybrid speech signal. 4. The method according to claim 1 , wherein the obtaining a loss function of the generative network model by using the separated speech sample, the hybrid speech sample, and the clean speech sample comprises: determining a first signal sample combination according to the separated speech sample and the hybrid speech sample; performing discriminative output on the first signal sample combination by using the discriminative network model to obtain a first discriminative output result, and obtaining a third distortion metric between the first discriminative output result and a second target output of the discriminative network model; obtaining a fourth distortion metric between the separated speech sample and clean speech; and obtaining the loss function of the generative network model according to the third distortion metric and the fourth distortion metric. 5. The method according to claim 4 , wherein the obtaining a fourth distortion metric between the separated speech sample and clean speech comprises: performing a permutation invariant calculation on the separated speech sample and the clean speech sample to obtain a correspondence result between the separated speech sample and the clean speech sample; and obtaining the fourth distortion metric according to the correspondence result between the separated speech sample and the clean speech sample. 6. The multi-person speech separation method according to claim 1 , further comprising: performing a speech separation on a hybrid speech signal by using the GAN model, the hybrid speed signal including N human voices, N being a positive integer greater than or equal to claim 2 . 7. The multi-person speech separation method according to claim 6 , wherein performing the speech separation on the hybrid speech signal by using the GAN model comprises: extracting a hybrid speech feature from the hybrid speech signal; extracting a masking coefficient of the hybrid speech feature to obtain a masking matrix corresponding to the N human voices; and performing the speech separation on the masking matrix and outputting N separated speech signals corresponding to the N human voices. 8. A multi-person speech separation apparatus, comprising: a memory storing computer program instructions; and a processor coupled to the memory and, when executing the computer program instructions, configured to perform a multi-person speech separation method by using a generative adversarial network (GAN) model, the GAN model including a generative network model and a discriminative network model, the method comprising: obtaining a hybrid speech sample and a clean speech sample from a sample database; extracting a hybrid speech sample feature from the hybrid speech sample; extracting a masking coefficient of the hybrid speech sample feature by using the generative network model, to obtain a sample masking matrix; performing a speech separation on the sample masking matrix and the hybrid speech sample by using the generative network model, and outputting a separated speech sample; performing alternate training on the generative network model and the discriminative network model by using the separated speech sample, the hybrid speech sample, and the clean speech sample, wherein the alternate training is performed by: fixing the generative network model during a current time of training of the discriminative network model; obtaining a loss function of the discriminative network model by using the separated speech sample, the hybrid speech sample, and the clean speech sample; optimizing the discriminative network model by minimizing the loss function of the discriminative network model; fixing the discriminative network model during a next time of training of the generative network model; obtaining a loss function of the generative network model by using the separated speech sample, the hybrid speech sample, and the clean speech sample; and optimizing the generative network model by minimizing the loss function of the generative network model. 9. The apparatus according to claim 8 , wherein the obtaining a loss function of the discriminative network model by using the separated speech sample, the hybrid speech sample, and the clean speech sample comprises:

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Adversarial learning · CPC title

  • Generative networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11450337B2 cover?
A multi-person speech separation method is provided for a terminal. The method includes extracting a hybrid speech feature from a hybrid speech signal requiring separation, N human voices being mixed in the hybrid speech signal, N being a positive integer greater than or equal to 2; extracting a masking coefficient of the hybrid speech feature by using a generative adversarial network (GAN) mod…
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L21/0272. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 20 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).