Speech separation method, electronic device, chip, and computer- readable storage medium

US12334092B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12334092-B2
Application numberUS-202118026960-A
CountryUS
Kind codeB2
Filing dateAug 24, 2021
Priority dateSep 25, 2020
Publication dateJun 17, 2025
Grant dateJun 17, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A speech separation method is provided, and relates to the field of speech. The method includes: obtaining, in a speaking process of a user, audio information including a user speech and video information including a user face; coding the audio information to obtain a mixed acoustic feature; extracting a visual semantic feature of the user from the video information; inputting the mixed acoustic feature and the visual semantic feature into a preset visual speech separation network to obtain an acoustic feature of the user; and decoding the acoustic feature of the user to obtain a speech signal of the user. An electronic device, a chip, and a computer-readable storage medium are provided.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: obtaining, by an electronic device, audio information and video information, the audio information including a user speech, the video information including a user face of a user, wherein the audio information and the video information correspond to a speaking process of the user; coding, by the electronic device, the audio information to obtain a mixed acoustic feature; extracting, by the electronic device, a visual semantic feature of the user from the video information, the visual semantic feature comprising a feature of a facial motion of the user in the speaking process; inputting, by the electronic device, the mixed acoustic feature and the visual semantic feature into a preset visual speech separation network to obtain an acoustic feature of the user, wherein obtaining the acoustic feature of the user comprises: performing regularization and one-dimensional convolutional layer processing on the mixed acoustic feature to obtain a deep mixed acoustic feature; upsampling the visual semantic feature to obtain a deep visual semantic feature that is time-synchronized with the deep mixed acoustic feature: connecting the deep mixed acoustic feature and the deep visual semantic feature in a channel dimension; performing dimension transformation to obtain a fused visual and auditory feature; predicting a mask value of the user speech based on the fused visual and auditory feature; performing mapping and output processing on the mask value to obtain a mask output; and performing a matrix dot product calculation on the mask output and the mixed acoustic feature to obtain the acoustic feature of the user; and decoding, by the electronic device, the acoustic feature of the user to obtain a speech signal of the user. 2. The method according to claim 1 , wherein the audio information is mixed speech information comprising the user speech and an environmental noise; and wherein coding the audio information comprises: constructing a time-domain audio coder based on a convolutional neural network; and performing time-domain coding on the audio information by using the time-domain audio coder. 3. The method according to claim 2 , wherein decoding the acoustic feature of the user to obtain the speech signal of the user comprises: constructing a time-domain audio decoder based on the convolutional neural network; and decoding the acoustic feature of the user by using the time-domain audio decoder to obtain a time-domain speech signal of the user. 4. The method according to claim 1 , wherein the audio information is mixed speech information comprising the user speech and an environmental noise; and wherein coding the audio information comprises: performing time-domain coding on the audio information by using a preset short-time Fourier transform algorithm. 5. The method according to claim 4 , wherein decoding the acoustic feature of the user to obtain the speech signal of the user comprises: decoding the acoustic feature of the user by using a preset inverse short-time Fourier transform algorithm to obtain a time-domain speech signal of the user. 6. The method according to claim 1 , wherein extracting the visual semantic feature of the user from the video information comprises: converting the video information into image frames arranged in a frame play sequence; processing each of the image frames to obtain a plurality of face thumbnails that have a preset size and comprise the user face; and inputting the plurality of face thumbnails into a preset decoupling network to extract the visual semantic feature of the user. 7. The method according to claim 6 , wherein processing each of the image frames to obtain the plurality of face thumbnails that have the preset size and comprise the user face comprises: locating a corresponding image area comprising the user face in each of the image frames; and zooming in or out the corresponding image area to obtain a corresponding face thumbnail of the plurality of face thumbnails that has the preset size and comprises the user face. 8. The method of claim 6 , wherein the preset decoupling network includes a visual coder, a speech coder, a classifier, a binary classification discriminator, and an identity discriminator. 9. The method of claim 8 , wherein the weight of the visual coder is frozen to train the identity discriminator, and the weight of the identity discriminator is frozen to train the visual coder. 10. The method of claim 6 , wherein the preset decoupling network is trained using N video samples and N audio samples, where N is a positive integer greater than 1. 11. The method of claim 6 , wherein training the preset decoupling network includes: obtaining an m th audio sample from a plurality of audio samples and an n th video samples from a plurality of video samples, wherein the m th audio sample does not match the n th video sample; obtaining, from the m th audio sample, a speech representation including a sound feature; obtaining, from the n th video sample, a visual representation including a facial identity feature and a visual semantic feature; and shortening a distance between the speech representation and the visual representation. 12. The method of claim 11 , wherein shortening the distance comprises: (a) performing a word-level audio-visual speech recognition task, and recording a loss; (b) performing adversarial training by using a binary classification discriminator to recognize whether an input characterization is a visual characterization or an audio characterization; or (c) minimizing the distance based on a loss. 13. The method according to claim 1 , wherein performing the mapping and output processing on the mask value uses a sigmoid function or a Tanh function. 14. An electronic device, comprising: at least one processor; and a memory storing processor-executable instructions; wherein the at least one processor is configured to execute the processor-executable instructions to facilitate the following being performed by the electronic device: obtaining audio information and video information, the audio information including a user speech, the video information including a user face of a user, wherein the audio information and the video information correspond to a speaking process of the user; coding the audio information to obtain a mixed acoustic feature; extracting a visual semantic feature of the user from the video information, the visual semantic feature comprising a feature of a facial motion of the user in the speaking process; inputting the mixed acoustic feature and the visual semantic feature into a preset visual speech separation network to obtain an acoustic feature of the user, wherein obtaining the acoustic feature of the user comprises: performing regularization and one-dimensional convolutional layer processing on the mixed acoustic feature to obtain a deep mixed acoustic feature; upsampling the visual semantic feature to obtain a deep visual semantic feature that is time-synchronized with the deep mixed acoustic feature; connecting the deep mixed acoustic feature and the deep visual semantic feature in a channel dimension; performing dimension transformation to obtain a fused visual and auditory feature; predicting a mask value of the user speech based on the fused visual and auditory feature; performing mapping and output processing on the mask value to obtain a mask output; and performing a matrix dot product calculation on the mask output and the mixed acoustic feature to obtain the acoustic feature of the user; and decoding the acoustic featur

Assignees

Inventors

Classifications

  • for synchronising with other signals, e.g. video signals · CPC title

  • Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

  • for processing of video signals · CPC title

  • Voice signal separating · CPC title

  • using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12334092B2 cover?
A speech separation method is provided, and relates to the field of speech. The method includes: obtaining, in a speaking process of a user, audio information including a user speech and video information including a user face; coding the audio information to obtain a mixed acoustic feature; extracting a visual semantic feature of the user from the video information; inputting the mixed acousti…
Who is the assignee on this patent?
Huawei Tech Co Ltd, Inst Automation Cas
What technology area does this patent fall under?
Primary CPC classification G10L21/0208. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 17 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).