Anchored speech detection and speech recognition
US-10373612-B2 · Aug 6, 2019 · US
US11996091B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11996091-B2 |
| Application number | US-202016989844-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 10, 2020 |
| Priority date | May 24, 2018 |
| Publication date | May 28, 2024 |
| Grant date | May 28, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A mixed speech recognition method, a mixed speech recognition apparatus, and a computer-readable storage medium are provided. The mixed speech recognition method includes: monitoring an input of speech input and detecting an enrollment speech and a mixed speech; acquiring speech features of a target speaker based on the enrollment speech; and determining speech belonging to the target speaker in the mixed speech based on the speech features of the target speaker. The enrollment speech includes preset speech information, and the mixed speech is non-enrollment speech inputted after the enrollment speech.
Opening claim text (preview).
What is claimed is: 1. A mixed speech recognition method, applied to a computer device, the method comprising: monitoring speech input and detecting an enrollment speech of a target speaker and a mixed speech from the speech input, the enrollment speech comprising preset speech information, and the mixed speech being non-enrollment speech inputted after the enrollment speech; separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space by using a deep neural network of a recognition network to obtain a vector of each frame of the enrollment speech in each vector dimension and a vector of each frame of the mixed speech in each vector dimension, K being not less than 1, and the recognition network being trained by: obtaining an estimated speech extractor of each frame of an enrollment speech training sample according to a vector of each frame of the enrollment speech training sample in each vector dimension of the K-dimensional vector space and a supervised labeling value of each frame of the enrollment speech training sample, the supervised labeling value being set by separately comparing a spectrum amplitude of each frame of the enrollment speech sample with a difference between a largest spectrum amplitude of the enrollment speech sample and a spectrum threshold; obtaining an estimated mask of the target speaker by measuring a distance between a vector of each frame of a mixed speech training sample and the estimated speech extractor in each vector dimension of the K-dimensional vector space; recovering a speech of the target speaker using the estimated mask and the spectrum of the mixed speech training sample; and training the recognition network by minimizing the objective function that describes a spectral error between the recovered speech of the target speaker and a reference speech of the target speaker, the spectral error being a reconstruction error of L 2 based on a spectrum of the reference speech of the target speaker and a spectrum of the recovered speech; calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension; determining a speech extractor of the target speaker in each vector dimension, and separately measuring a distance between the vector of each frame of the mixed speech in each vector dimension and the speech extractor of the corresponding vector dimension to obtain a mask of each frame in the mixed speech, wherein the speech extractor of the target speaker in each vector dimension is a centroid of the estimated speech extractor of each frame of the enrollment speech training sample of the target speaker in each vector dimension obtained during training of the recognition network, and the speech extractor of the target speaker is not re-estimated after the training of the recognition network is complete; and determining speech belonging to the target speaker in the mixed speech based on the mask of each frame of the mixed speech. 2. The mixed speech recognition method according to claim 1 , wherein the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension comprises: calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension, the effective frame of the enrollment speech being a frame in the enrollment speech with a spectrum amplitude greater than a spectrum amplitude comparison value, and the spectrum amplitude comparison value being equal to a difference between the largest spectrum amplitude of the enrollment speech and a preset spectrum threshold. 3. The mixed speech recognition method according to claim 2 , wherein the calculating the average vector of the enrollment speech in each vector dimension based on the vector of an effective frame of the enrollment speech in each vector dimension comprises: summing, after the vector of each frame of the enrollment speech in the corresponding vector dimension is multiplied by a supervised labeling value of the corresponding frame, vector dimensions to obtain a total vector of the effective frame of the enrollment speech in the corresponding vector dimension; and separately dividing the total vector of the effective frame of the enrollment speech in each vector dimension by the sum of the supervised labeling values of the frames of the enrollment speech to obtain the average vector of the enrollment speech in each vector dimension; the supervised labeling value of a frame in the enrollment speech being 1 when a spectrum amplitude of the frame is greater than the spectrum amplitude comparison value; and being 0 when the spectrum amplitude of the frame is not greater than the spectrum amplitude comparison value. 4. The mixed speech recognition method according to claim 1 , further comprising: after calculating the average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, inputting the average vector of the enrollment speech in each vector dimension and the vector of each frame of the mixed speech in each vector dimension to a pre-trained feedforward neural network to obtain a normalized vector of each frame in each vector dimension; wherein the speech belonging to the target speaker in the mixed speech is determined based on the mask of each frame of the mixed speech. 5. The mixed speech recognition method according to claim 1 , wherein the average vector of the enrollment speech in each vector dimension is used as the speech extractor of the target speaker in each vector dimension. 6. The mixed speech recognition method according to claim 1 , wherein the mixed speech includes speeches of multiple speakers, and after the separately mapping a spectrum of the enrollment speech and a spectrum of the mixed speech into a K-dimensional vector space, the method further comprises: processing the vector of each frame of the mixed speech in each vector dimension based on a clustering algorithm to determine, for each of the multiple speakers in the mixed speech, a centroid vector corresponding to the speaker in each vector dimension; and using a target centroid vector of the mixed speech in each vector dimension as the speech extractor of the target speaker in the corresponding vector dimension, the target centroid vector being a centroid vector with the smallest distance from the average vector of the enrollment speech in the same vector dimension. 7. The mixed speech recognition method according to claim 1 , wherein after the calculating an average vector of the enrollment speech in each vector dimension based on the vector of each frame of the enrollment speech in each vector dimension, the method further comprises: separately comparing a distance between M preset speech extractors and the average vector of the enrollment speech in each vector dimension, M being greater than 1; and using a speech extractor with the smallest distance from the average vector of the enrollment speech in a vector dimension in the M preset speech extractors as the speech extractor of the target speaker in the corresponding vector dimension. 8. The mixed speech recognition method according to claim 1 , wherein the deep neural network is composed of four layers of bidirectional long short-term memory networks, each layer of the bidirectional long short-term memory network has 600 nodes; and a value of K is 40. 9. The method according to claim 1 , wherein obtaining an estimated mask of the target speaker by measuring a
Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
using artificial neural networks · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Decision making techniques; Pattern matching strategies · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.