Mixed speech recognition method and apparatus, and computer-readable storage medium

US11996091B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11996091-B2
Application numberUS-202016989844-A
CountryUS
Kind codeB2
Filing dateAug 10, 2020
Priority dateMay 24, 2018
Publication dateMay 28, 2024
Grant dateMay 28, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A mixed speech recognition method, a mixed speech recognition apparatus, and a computer-readable storage medium are provided. The mixed speech recognition method includes: monitoring an input of speech input and detecting an enrollment speech and a mixed speech; acquiring speech features of a target speaker based on the enrollment speech; and determining speech belonging to the target speaker in the mixed speech based on the speech features of the target speaker. The enrollment speech includes preset speech information, and the mixed speech is non-enrollment speech inputted after the enrollment speech.

First claim

Opening claim text (preview).

What is claimed is: 1. A mixed speech recognition method, applied to a computer device, the method comprising: monitoring speech input and detecting an enrollment speech of a target speaker and a mixed speech from the speech input, the enrollment speech comprising preset speech information, and the mixed speech being non-enrollment speech inputted after the enrollment speech; separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space by using a deep neural network of a recognition network to obtain a vector of each frame of the enrollment speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension, K being not less than 1, and the recognition network being trained by: obtaining an estimated speech extractor of each frame of an enrollment speech training sample according to a vector of each frame of the enrollment speech training sample in each vector dimension of the K-dimensional vector space and a supervised labeling value of each frame of the enrollment speech training sample, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold; obtaining an estimated mask of the target speaker by measuring a distance between a vector of each frame of a mixed speech training sample and the estimated speech extractor in each vector dimension of the K-dimensional vector space; recovering a speech of the target speaker using the estimated mask and the spectrum of the mixed speech training sample; and training the recognition network by minimizing the objective function that describes a spectral error between the recovered speech of the target speaker and a reference speech of the target speaker, the spectral error being a reconstruction error of L 2 based on a spectrum of the reference speech of the target speaker and a spectrum of the recovered speech; calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension; determining a speech extractor of the target speaker in each vector dimension, and separately measuring a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to obtain a mask of each frame in the mixed speech, wherein the speech extractor of the target speaker in each vector dimension is a centroid of the estimated speech extractor of each frame of the enrollment speech training sample of the target speaker in each vector dimension obtained during training of the recognition network, and the speech extractor of the target speaker is not re-estimated after the training of the recognition network is complete; and determining speech belonging to the target speaker in the mixed speech based on the mask of each frame of the mixed speech. 2. The mixed speech recognition method according to claim 1 , wherein the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension comprises: calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension, the effective frame of the enrollment speech being a frame in the enrollment speech with a spectrum amplitude greater than a spectrum amplitude comparison value, and the spectrum amplitude comparison value being equal to a difference between the largest spectrum amplitude of the enrollment speech and a preset spectrum threshold. 3. The mixed speech recognition method according to claim 2 , wherein the calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension comprises: summing, after the vector of each frame of the enrollment speech in the corresponding vector dimension is multiplied by a supervised labeling value of the corresponding frame, vector dimensions to obtain a total vector of the effective frame of the enrollment speech in the corresponding vector dimension; and separately dividing the total vector of the effective frame of the enrollment speech in each vector dimension by the sum of the supervised labeling values of the frames of the enrollment speech to obtain the average vector of the enrollment speech in each vector dimension; the supervised labeling value of a frame in the enrollment speech being 1 when a spectrum amplitude of the frame is greater than the spectrum amplitude comparison value; and being 0 when the spectrum amplitude of the frame is not greater than the spectrum amplitude comparison value. 4. The mixed speech recognition method according to claim 1 , further comprising: after calculating the average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, inputting the average vector of the enrollment speech in each vector dimension and the vector of each frame of the mixed speech in each vector dimension to a pre-trained feedforward neural network to obtain a normalized vector of each frame in each vector dimension; wherein the speech belonging to the target speaker in the mixed speech is determined based on the mask of each frame of the mixed speech. 5. The mixed speech recognition method according to claim 1 , wherein the average vector of the enrollment speech in each vector dimension is used as the speech extractor of the target speaker in each vector dimension. 6. The mixed speech recognition method according to claim 1 , wherein the mixed speech includes speeches of multiple speakers, and after the separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space, the method further comprises: processing the vector of each frame of the mixed speech in each vector dimension based on a clustering algorithm to determine, for each of the multiple speakers in the mixed speech, a centroid vector corresponding to the speaker in each vector dimension; and using a target centroid vector of the mixed speech in each vector dimension as the speech extractor of the target speaker in the corresponding vector dimension, the target centroid vector being a centroid vector with the smallest distance from the average vector of the enrollment speech in the same vector dimension. 7. The mixed speech recognition method according to claim 1 , wherein after the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, the method further comprises: separately comparing a distance between M preset speech extractors and the average vector of the enrollment speech in each vector dimension, M being greater than 1; and using a speech extractor with the smallest distance from the average vector of the enrollment speech in a vector dimension in the M preset speech extractors as the speech extractor of the target speaker in the corresponding vector dimension. 8. The mixed speech recognition method according to claim 1 , wherein the deep neural network is composed of four layers of bidirectional long short-term memory networks, each layer of the bidirectional long short-term memory network has 600 nodes; and a value of K is 40. 9. The method according to claim 1 , wherein obtaining an estimated mask of the target speaker by measuring a

Assignees

Inventors

Classifications

  • G10L15/20Primary

    Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title

  • Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • using artificial neural networks · CPC title

  • G10L15/22Primary

    Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

  • G10L17/06Primary

    Decision making techniques; Pattern matching strategies · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11996091B2 cover?
A mixed speech recognition method, a mixed speech recognition apparatus, and a computer-readable storage medium are provided. The mixed speech recognition method includes: monitoring an input of speech input and detecting an enrollment speech and a mixed speech; acquiring speech features of a target speaker based on the enrollment speech; and determining speech belonging to the target speaker i…
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G10L15/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 28 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).