Computer-implemented systems and methods for speaker recognition using a neural network
US-10008209-B1 · Jun 26, 2018 · US
US10515627B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10515627-B2 |
| Application number | US-201815980208-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 15, 2018 |
| Priority date | May 19, 2017 |
| Publication date | Dec 24, 2019 |
| Grant date | Dec 24, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method and apparatus of building an acoustic feature extracting model, and an acoustic feature extracting method and apparatus. The method of building an acoustic feature extracting model comprises: considering first acoustic features extracted respectively from speech data corresponding to user identifiers as training data; using the training data to train a deep neural network to obtain an acoustic feature extracting model; wherein a target of training the deep neural network is to maximize similarity between the same user's second acoustic features and minimize similarity between different users' second acoustic features. The acoustic feature extracting model according to the present disclosure can self-learn optimal acoustic features that achieves a training target. As compared with a conventional acoustic feature extracting manner with a preset feature type and transformation manner, the acoustic feature extracting manner of the present disclosure achieves better flexibility and higher accuracy.
Opening claim text (preview).
What is claimed is: 1. A method of building an acoustic feature extracting model, wherein the method comprises: considering first acoustic features extracted respectively from speech data corresponding to user identifiers as training data; and using the training data to train a deep neural network to obtain an acoustic feature extracting model; wherein a target of training the deep neural network is to maximize similarity between the same user's second acoustic features and minimize similarity between different users' second acoustic features, wherein using the training data to train a deep neural network to obtain an acoustic feature extracting model comprises: using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data; using the second acoustic features of respective speech data to calculate triplet loss, and using the triplet loss to tune parameters of the deep neural network to minimize the triplet loss; wherein the triplet loss reflects a state of difference between similarity between the second acoustic features of different users and similarity between the second acoustic features of the same user, and wherein the using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data comprises: using a deep neural network to learn first acoustic features of respective speech data and outputting second acoustic features at a frame level; performing pooling and sentence standardization processing for frame-level second acoustic features, and outputting sentence-level second acoustic features; the second acoustic features of respective speech data used upon calculating the triplet loss are sentence-level second acoustic features of respective speech data. 2. The method according to claim 1 , wherein the first acoustic features comprise FBank acoustic features. 3. The method according to claim 1 , wherein the deep neural network comprises a convolutional neural network CNN, a residual convolutional neural network ResCNN or a Gated Recurrent Unit GRU. 4. The method according to claim 1 , wherein the method further comprises: extracting first acoustic features of to-be-processed speech data; inputting the first acoustic features into the acoustic feature extracting model, to obtain second acoustic features of the to-be-processed speech data. 5. The method according to claim 4 , wherein the method further comprises: using the second acoustic features of the to-be-processed speech data to register a voiceprint model of a user identifier corresponding to the to-be-processed speech data; or matching the second acoustic features of the to-be-processed speech data with already-registered voiceprint models of user identifiers, to determine the user identifier corresponding to the to-be-processed speech data. 6. A device, comprising: a memory including one or more programs, one or more processors coupled to the memory and executing said one or more programs, to implement the following operation: considering first acoustic features extracted respectively from speech data corresponding to user identifiers as training data; and using the training data to train a deep neural network to obtain an acoustic feature extracting model; wherein a target of training the deep neural network is to maximize similarity between the same user's second acoustic features and minimize similarity between different users' second acoustic features, wherein using the training data to train a deep neural network to obtain an acoustic feature extracting model comprises: using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data; using the second acoustic features of respective speech data to calculate triplet loss, and using the triplet loss to tune parameters of the deep neural network to minimize the triplet loss; wherein the triplet loss reflects a state of difference between similarity between the second acoustic features of different users and similarity between the second acoustic features of the same user, and wherein the using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data comprises: using a deep neural network to learn first acoustic features of respective speech data and outputting second acoustic features at a frame level; performing pooling and sentence standardization processing for frame-level second acoustic features, and outputting sentence-level second acoustic features; the second acoustic features of respective speech data used upon calculating the triplet loss are sentence-level second acoustic features of respective speech data. 7. The device according to claim 6 , wherein the first acoustic features comprise FBank acoustic features. 8. The device according to claim 6 , wherein the deep neural network comprises a convolutional neural network CNN, a residual convolutional neural network ResCNN or a Gated Recurrent Unit GRU. 9. The device according to claim 6 , wherein the operation further comprises: extracting first acoustic features of to-be-processed speech data; inputting the first acoustic features into the acoustic feature extracting model, to obtain second acoustic features of the to-be-processed speech data. 10. The device according to claim 9 , wherein the operation further comprises: using the second acoustic features of the to-be-processed speech data to register a voiceprint model of a user identifier corresponding to the to-be-processed speech data; or matching the second acoustic features of the to-be-processed speech data with already-registered voiceprint models of user identifiers, to determine the user identifier corresponding to the to-be-processed speech data. 11. A non-transitory computer storage medium encoded with a computer program, the computer program, when executed by one or more computers, enabling said one or more computers to implement the following operation: considering first acoustic features extracted respectively from speech data corresponding to user identifiers as training data; and using the training data to train a deep neural network to obtain an acoustic feature extracting model; wherein a target of training the deep neural network is to maximize similarity between the same user's second acoustic features and minimize similarity between different users' second acoustic features, wherein using the training data to train a deep neural network to obtain an acoustic feature extracting model comprises: using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data; using the second acoustic features of respective speech data to calculate triplet loss, and using the triplet loss to tune parameters of the deep neural network to minimize the triplet loss; wherein the triplet loss reflects a state of difference between similarity between the second acoustic features of different users and similarity between the second acoustic features of the same user, and wherein the using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data comprises: using a deep neural network to learn first acoustic features of respective speech data and outputting second acoustic features at a frame level; performing pooling and sentence standardization processing for frame-level second acoustic feature
Training · CPC title
Decision making techniques; Pattern matching strategies · CPC title
using artificial neural networks · CPC title
Feature extraction for speech recognition; Selection of recognition unit · CPC title
Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.