Method and apparatus of building acoustic feature extracting model, and acoustic feature extracting method and apparatus

US10515627B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10515627-B2
Application numberUS-201815980208-A
CountryUS
Kind codeB2
Filing dateMay 15, 2018
Priority dateMay 19, 2017
Publication dateDec 24, 2019
Grant dateDec 24, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and apparatus of building an acoustic feature extracting model, and an acoustic feature extracting method and apparatus. The method of building an acoustic feature extracting model comprises: considering first acoustic features extracted respectively from speech data corresponding to user identifiers as training data; using the training data to train a deep neural network to obtain an acoustic feature extracting model; wherein a target of training the deep neural network is to maximize similarity between the same user's second acoustic features and minimize similarity between different users' second acoustic features. The acoustic feature extracting model according to the present disclosure can self-learn optimal acoustic features that achieves a training target. As compared with a conventional acoustic feature extracting manner with a preset feature type and transformation manner, the acoustic feature extracting manner of the present disclosure achieves better flexibility and higher accuracy.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of building an acoustic feature extracting model, wherein the method comprises: considering first acoustic features extracted respectively from speech data corresponding to user identifiers as training data; and using the training data to train a deep neural network to obtain an acoustic feature extracting model; wherein a target of training the deep neural network is to maximize similarity between the same user's second acoustic features and minimize similarity between different users' second acoustic features, wherein using the training data to train a deep neural network to obtain an acoustic feature extracting model comprises: using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data; using the second acoustic features of respective speech data to calculate triplet loss, and using the triplet loss to tune parameters of the deep neural network to minimize the triplet loss; wherein the triplet loss reflects a state of difference between similarity between the second acoustic features of different users and similarity between the second acoustic features of the same user, and wherein the using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data comprises: using a deep neural network to learn first acoustic features of respective speech data and outputting second acoustic features at a frame level; performing pooling and sentence standardization processing for frame-level second acoustic features, and outputting sentence-level second acoustic features; the second acoustic features of respective speech data used upon calculating the triplet loss are sentence-level second acoustic features of respective speech data. 2. The method according to claim 1 , wherein the first acoustic features comprise FBank acoustic features. 3. The method according to claim 1 , wherein the deep neural network comprises a convolutional neural network CNN, a residual convolutional neural network ResCNN or a Gated Recurrent Unit GRU. 4. The method according to claim 1 , wherein the method further comprises: extracting first acoustic features of to-be-processed speech data; inputting the first acoustic features into the acoustic feature extracting model, to obtain second acoustic features of the to-be-processed speech data. 5. The method according to claim 4 , wherein the method further comprises: using the second acoustic features of the to-be-processed speech data to register a voiceprint model of a user identifier corresponding to the to-be-processed speech data; or matching the second acoustic features of the to-be-processed speech data with already-registered voiceprint models of user identifiers, to determine the user identifier corresponding to the to-be-processed speech data. 6. A device, comprising: a memory including one or more programs, one or more processors coupled to the memory and executing said one or more programs, to implement the following operation: considering first acoustic features extracted respectively from speech data corresponding to user identifiers as training data; and using the training data to train a deep neural network to obtain an acoustic feature extracting model; wherein a target of training the deep neural network is to maximize similarity between the same user's second acoustic features and minimize similarity between different users' second acoustic features, wherein using the training data to train a deep neural network to obtain an acoustic feature extracting model comprises: using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data; using the second acoustic features of respective speech data to calculate triplet loss, and using the triplet loss to tune parameters of the deep neural network to minimize the triplet loss; wherein the triplet loss reflects a state of difference between similarity between the second acoustic features of different users and similarity between the second acoustic features of the same user, and wherein the using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data comprises: using a deep neural network to learn first acoustic features of respective speech data and outputting second acoustic features at a frame level; performing pooling and sentence standardization processing for frame-level second acoustic features, and outputting sentence-level second acoustic features; the second acoustic features of respective speech data used upon calculating the triplet loss are sentence-level second acoustic features of respective speech data. 7. The device according to claim 6 , wherein the first acoustic features comprise FBank acoustic features. 8. The device according to claim 6 , wherein the deep neural network comprises a convolutional neural network CNN, a residual convolutional neural network ResCNN or a Gated Recurrent Unit GRU. 9. The device according to claim 6 , wherein the operation further comprises: extracting first acoustic features of to-be-processed speech data; inputting the first acoustic features into the acoustic feature extracting model, to obtain second acoustic features of the to-be-processed speech data. 10. The device according to claim 9 , wherein the operation further comprises: using the second acoustic features of the to-be-processed speech data to register a voiceprint model of a user identifier corresponding to the to-be-processed speech data; or matching the second acoustic features of the to-be-processed speech data with already-registered voiceprint models of user identifiers, to determine the user identifier corresponding to the to-be-processed speech data. 11. A non-transitory computer storage medium encoded with a computer program, the computer program, when executed by one or more computers, enabling said one or more computers to implement the following operation: considering first acoustic features extracted respectively from speech data corresponding to user identifiers as training data; and using the training data to train a deep neural network to obtain an acoustic feature extracting model; wherein a target of training the deep neural network is to maximize similarity between the same user's second acoustic features and minimize similarity between different users' second acoustic features, wherein using the training data to train a deep neural network to obtain an acoustic feature extracting model comprises: using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data; using the second acoustic features of respective speech data to calculate triplet loss, and using the triplet loss to tune parameters of the deep neural network to minimize the triplet loss; wherein the triplet loss reflects a state of difference between similarity between the second acoustic features of different users and similarity between the second acoustic features of the same user, and wherein the using a deep neural network to learn first acoustic features of respective speech data, and outputting second acoustic features of respective speech data comprises: using a deep neural network to learn first acoustic features of respective speech data and outputting second acoustic features at a frame level; performing pooling and sentence standardization processing for frame-level second acoustic feature

Assignees

Inventors

Classifications

  • Training · CPC title

  • Decision making techniques; Pattern matching strategies · CPC title

  • G10L15/16Primary

    using artificial neural networks · CPC title

  • G10L15/02Primary

    Feature extraction for speech recognition; Selection of recognition unit · CPC title

  • Procedures used during a speech recognition process, e.g. man-machine dialogue · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10515627B2 cover?
A method and apparatus of building an acoustic feature extracting model, and an acoustic feature extracting method and apparatus. The method of building an acoustic feature extracting model comprises: considering first acoustic features extracted respectively from speech data corresponding to user identifiers as training data; using the training data to train a deep neural network to obtain an …
Who is the assignee on this patent?
Baidu online network technology beijing co ltd
What technology area does this patent fall under?
Primary CPC classification G10L15/16. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 24 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).