What technology area does this patent fall under?

Primary CPC classification G10L15/20. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 28 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Mixed speech recognition method and apparatus, and computer-readable storage medium

US11996091B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11996091-B2
Application number	US-202016989844-A
Country	US
Kind code	B2
Filing date	Aug 10, 2020
Priority date	May 24, 2018
Publication date	May 28, 2024
Grant date	May 28, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A mixed speech recognition method, a mixed speech recognition apparatus, and a computer-readable storage medium are provided. The mixed speech recognition method includes: monitoring an input of speech input and detecting an enrollment speech and a mixed speech; acquiring speech features of a target speaker based on the enrollment speech; and determining speech belonging to the target speaker in the mixed speech based on the speech features of the target speaker. The enrollment speech includes preset speech information, and the mixed speech is non-enrollment speech inputted after the enrollment speech.

First claim

Opening claim text (preview).

What is claimed is: 1. A mixed speech recognition method, applied to a computer device, the method comprising: monitoring speech input and detecting an enrollment speech of a target speaker and a mixed speech from the speech input, the enrollment speech comprising preset speech information, and the mixed speech being non-enrollment speech inputted after the enrollment speech; separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space by using a deep neural network of a recognition network to obtain a vector of each frame of the enrollment speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension, K being not less than 1, and the recognition network being trained by: obtaining an estimated speech extractor of each frame of an enrollment speech training sample according to a vector of each frame of the enrollment speech training sample in each vector dimension of the K-dimensional vector space and a supervised labeling value of each frame of the enrollment speech training sample, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold; obtaining an estimated mask of the target speaker by measuring a distance between a vector of each frame of a mixed speech training sample and the estimated speech extractor in each vector dimension of the K-dimensional vector space; recovering a speech of the target speaker using the estimated mask and the spectrum of the mixed speech training sample; and training the recognition network by minimizing the objective function that describes a spectral error between the recovered speech of the target speaker and a reference speech of the target speaker, the spectral error being a reconstruction error of L 2 based on a spectrum of the reference speech of the target speaker and a spectrum of the recovered speech; calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension; determining a speech extractor of the target speaker in each vector dimension, and separately measuring a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to obtain a mask of each frame in the mixed speech, wherein the speech extractor of the target speaker in each vector dimension is a centroid of the estimated speech extractor of each frame of the enrollment speech training sample of the target speaker in each vector dimension obtained during training of the recognition network, and the speech extractor of the target speaker is not re-estimated after the training of the recognition network is complete; and determining speech belonging to the target speaker in the mixed speech based on the mask of each frame of the mixed speech. 2. The mixed speech recognition method according to claim 1 , wherein the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension comprises: calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension, the effective frame of the enrollment speech being a frame in the enrollment speech with a spectrum amplitude greater than a spectrum amplitude comparison value, and the spectrum amplitude comparison value being equal to a difference between the largest spectrum amplitude of the enrollment speech and a preset spectrum threshold. 3. The mixed speech recognition method according to claim 2 , wherein the calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension comprises: summing, after the vector of each frame of the enrollment speech in the corresponding vector dimension is multiplied by a supervised labeling value of the corresponding frame, vector dimensions to obtain a total vector of the effective frame of the enrollment speech in the corresponding vector dimension; and separately dividing the total vector of the effective frame of the enrollment speech in each vector dimension by the sum of the supervised labeling values of the frames of the enrollment speech to obtain the average vector of the enrollment speech in each vector dimension; the supervised labeling value of a frame in the enrollment speech being 1 when a spectrum amplitude of the frame is greater than the spectrum amplitude comparison value; and being 0 when the spectrum amplitude of the frame is not greater than the spectrum amplitude comparison value. 4. The mixed speech recognition method according to claim 1 , further comprising: after calculating the average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, inputting the average vector of the enrollment speech in each vector dimension and the vector of each frame of the mixed speech in each vector dimension to a pre-trained feedforward neural network to obtain a normalized vector of each frame in each vector dimension; wherein the speech belonging to the target speaker in the mixed speech is determined based on the mask of each frame of the mixed speech. 5. The mixed speech recognition method according to claim 1 , wherein the average vector of the enrollment speech in each vector dimension is used as the speech extractor of the target speaker in each vector dimension. 6. The mixed speech recognition method according to claim 1 , wherein the mixed speech includes speeches of multiple speakers, and after the separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space, the method further comprises: processing the vector of each frame of the mixed speech in each vector dimension based on a clustering algorithm to determine, for each of the multiple speakers in the mixed speech, a centroid vector corresponding to the speaker in each vector dimension; and using a target centroid vector of the mixed speech in each vector dimension as the speech extractor of the target speaker in the corresponding vector dimension, the target centroid vector being a centroid vector with the smallest distance from the average vector of the enrollment speech in the same vector dimension. 7. The mixed speech recognition method according to claim 1 , wherein after the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, the method further comprises: separately comparing a distance between M preset speech extractors and the average vector of the enrollment speech in each vector dimension, M being greater than 1; and using a speech extractor with the smallest distance from the average vector of the enrollment speech in a vector dimension in the M preset speech extractors as the speech extractor of the target speaker in the corresponding vector dimension. 8. The mixed speech recognition method according to claim 1 , wherein the deep neural network is composed of four layers of bidirectional long short-term memory networks, each layer of the bidirectional long short-term memory network has 600 nodes; and a value of K is 40. 9. The method according to claim 1 , wherein obtaining an estimated mask of the target speaker by measuring a

Assignees

Tencent Tech Shenzhen Co Ltd

Inventors

Classifications

G10L15/20Primary
Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title
G10L15/02
Feature extraction for speech recognition; Selection of recognition unit · CPC title
G10L15/16
using artificial neural networks · CPC title
G10L15/22Primary
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
G10L17/06Primary
Decision making techniques; Pattern matching strategies · CPC title

Patent family

Related publications grouped by family.

View patent family 64499498

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11996091B2 cover?: A mixed speech recognition method, a mixed speech recognition apparatus, and a computer-readable storage medium are provided. The mixed speech recognition method includes: monitoring an input of speech input and detecting an enrollment speech and a mixed speech; acquiring speech features of a target speaker based on the enrollment speech; and determining speech belonging to the target speaker i…
Who is the assignee on this patent?: Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?: Primary CPC classification G10L15/20. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 28 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).