Acoustic model training method, speech recognition method, apparatus, device and medium
US-2021125603-A1 · Apr 29, 2021 · US
US11508381B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11508381-B2 |
| Application number | US-202017085609-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 30, 2020 |
| Priority date | Oct 10, 2018 |
| Publication date | Nov 22, 2022 |
| Grant date | Nov 22, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of this application disclose a voiceprint recognition method performed by a computer. After obtaining a to-be-recognized target voice message, the computer obtains target feature information of the target voice message by using a voice recognition model, the voice recognition model being obtained through training according to a first loss function and a second loss function. Next, the computer determines a voiceprint recognition result according to the target feature information and registration feature information, the registration feature information being obtained from a voice message of a to-be-recognized object using the voiceprint recognition model. The normalized exponential function and the centralization function are used for jointly optimizing the voice recognition model, and can reduce an intra-class variation between depth features from the same speaker. The two functions are used for simultaneously supervising and learning the voice recognition model, and enable the depth feature to have better discrimination, thereby improving recognition performance.
Opening claim text (preview).
What is claimed is: 1. A voiceprint recognition method, comprising: obtaining a target voice message; obtaining text-independent target feature information of the target voice message by using a voiceprint recognition model, the voiceprint recognition model obtained through training according to a first loss function and a second loss function, the first loss function being a normalized exponential function that discriminates between deep features associated with different objects, and the second loss function being a centralization function that reduces variations in the deep features associated with the same object, and the voiceprint recognition model is obtained through training a convolutional neural network (CNN) by: obtaining a voice message set comprising voice messages corresponding to multiple training objects; capturing, from the voice messages, voice segments; inputting the captured voice segments to the CNN to obtain a deep feature of a sentence level for each of the voice messages; training the CNN as the voiceprint recognition model by joint supervision of the deep features of the voice messages corresponding to the training objects with the first loss function discriminating the deep features corresponding to different training objects in the voice messages and the second loss function reducing variations in the deep features of the same training object; and determining a voiceprint recognition result by comparing the target feature information and registration feature information, the registration feature information obtained from a voice message of an object using the voiceprint recognition model. 2. The method according to claim 1 , wherein determining the voiceprint recognition result comprises: calculating a cosine similarity according to the target feature information and the registration feature information; determining that the target voice message is a voice message of the object in accordance with a determination that the cosine similarity reaches a first similarity threshold; and determining that the target voice message is not a voice message of the object in accordance with a determination that the cosine similarity does not reach the first similarity threshold. 3. The method according to claim 1 , wherein determining the voiceprint recognition result comprises: calculating a log-likelihood ratio between the target feature information and the registration feature information using a PLDA classifier; determining that the target voice message is a voice message of the object in accordance with a determination that the log-likelihood ratio reaches a second similarity threshold; and determining that the target voice message is not a voice message of the object in accordance with a determination that the log-likelihood ratio does not reach the second similarity threshold. 4. The method according to claim 1 , wherein training the CNN further comprises: determining, for each of the voice messages, a deep feature corresponding to the voice message using the CNN; obtaining a fully connected layer weight matrix according to the voice messages; and determining the first loss function according to the deep feature of each of the voice messages and the fully connected layer weight matrix. 5. The method according to claim 4 , wherein determining the first loss function according to the deep feature of each of the voice messages and the fully connected layer weight matrix comprises: determining the first loss function according to: L s = - ∑ i = 1 M log e W y i T x i + b y i ∑ j = 1 N e W v T x i + b j , wherein L S represents the first loss function, X i represents representing the i th deep feature from the y i th object, w v represents the v th column in the fully connected layer weight matrix, b i represents a bias of the j th class, each class corresponding to an object, M represents a group size of a training set corresponding to the voice message set, and N represents a quantity of objects corresponding to the voice message set. 6. The method according to claim 1 , wherein training the CNN further comprises: determining, for each of the voice messages, a deep feature corresponding to the voice message using the CNN; calculating a deep feature gradient according to the deep feature of each of the voice messages; calculating a second voice mean according to the deep feature gradient and a first voice mean; and determining the second loss function according to the deep feature of each of the voice messages and the second voice mean. 7. The method according to claim 6 , wherein calculating the deep feature gradient according to the deep feature of each of the voice messages comprises: calculating the deep feature gradient according to: Δ μ j = ∑ i =
Related publications grouped by family.
Answers are generated from the same data shown on this page.