Adaptive permutation invariant training with auxiliary information for monaural multi-talker speech recognition
US-2019304438-A1 · Oct 3, 2019 · US
US11900917B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11900917-B2 |
| Application number | US-202117230515-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 14, 2021 |
| Priority date | Jan 29, 2019 |
| Publication date | Feb 13, 2024 |
| Grant date | Feb 13, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A neural network training method is provided. The method includes obtaining an audio data stream, performing, for different audio data of each time frame in the audio data stream, feature extraction in each layer of a neural network, to obtain a depth feature outputted by a corresponding time frame, fusing, for a given label in labeling data, an inter-class confusion measurement index and an intra-class distance penalty value relative to the given label in a set loss function for the audio data stream through the depth feature, and updating a parameter in the neural network by using a loss function value obtained through fusion.
Opening claim text (preview).
What is claimed is: 1. A neural network training method for implementing audio recognition, applicable to an audio recognition terminal, the method comprising: obtaining an audio data stream for neural network training of audio recognition, the audio data stream including audio data respectively corresponding to a plurality of time frames; performing, for different audio data of each time frame in the audio data stream, feature extraction in each layer of a trained neural network, to obtain a depth feature outputted by a corresponding time frame; fusing, for a given label in labeling data, an inter-class confusion measurement index and an intra-class distance penalty value relative to the given label in a set loss function for the audio data stream through the depth feature; and obtaining, through fusion, a loss function value relative to a series of given labels in the labeling data, to update a parameter in the neural network. 2. The method according to claim 1 , wherein the obtaining an audio data stream for neural network training of audio recognition comprises: obtaining a noisy and continuous audio data stream and training data with the neural network as labeling data. 3. The method according to claim 1 , wherein the fusing the inter-class confusion measurement index and the intra-class distance penalty value comprises: obtaining, for the given label in the labeling data, a center vector corresponding to a category to which the given label belongs, the center vector being used for describing centers of all depth features in the category; and fusing, according to the depth feature and the center vector, an inter-class confusion measurement index and an intra-class distance penalty value relative to the given label in a set loss function for audio data of the time frame, to obtain a loss function value of the audio data relative to the given label. 4. The method according to claim 3 , wherein the fusing the inter-class confusion measurement index and the intra-class distance penalty value comprises: calculating a center loss of the given label by using the depth feature and the center vector, to obtain an intra-class distance penalty value of the audio data of the time frame relative to the given label. 5. The method according to claim 4 , wherein the fusing the inter-class confusion measurement index and the intra-class distance penalty value further comprises: calculating, according to the depth feature, an inter-class confusion measurement index of the audio data of the time frame relative to the given label by using a cross-entropy loss function. 6. The method according to claim 4 , wherein the fusing the inter-class confusion measurement index and the intra-class distance penalty value further comprises: performing weighting calculation on the intra-class distance penalty value and the inter-class confusion measurement index of the audio data relative to the given label in the set loss function according to a specified weighting factor, to obtain the loss function value of the audio data relative to the given label. 7. The method according to claim 6 , wherein audio data of different time frames in the audio data stream is labeled through the given label in the labeling data. 8. The method according to claim 1 , wherein a blank label is added to the labeling data, and the fusing the inter-class confusion measurement index and the intra-class distance penalty value comprises: obtaining center vectors corresponding to categories to which the given label in the labeling data and the added blank label belong; and calculating, for a depth feature sequence formed by the audio data stream for the depth feature in a time sequence, a probability that the audio data stream is mapped to a given sequence label and distances of the given sequence label respectively relative to the center vectors, to obtain an intra-class distance penalty value of the audio data stream relative to the given sequence label, the given sequence label comprising the added blank label and the given label. 9. The method according to claim 8 , wherein the fusing the inter-class confusion measurement index and the intra-class distance penalty value further comprises: calculating a probability distribution of the audio data stream relative to the given sequence label according to the depth feature, and calculating a log-likelihood cost of the audio data stream through the probability distribution as an inter-class confusion measurement index of the audio data stream relative to the given sequence label. 10. The method according to claim 8 , wherein the fusing the inter-class confusion measurement index and the intra-class distance penalty value further comprises: performing weighting calculation on the inter-class confusion measurement index and the intra-class distance penalty value of the audio data stream relative to the given sequence label in the set loss function according to a specified weighting factor, to obtain a loss function value of the audio data stream relative to the given sequence label. 11. The method according to claim 10 , wherein the labeling data of the audio data stream is an unaligned discrete label string, a blank label is added to the discrete label string, and the added blank label and the given label in the labeling data respectively correspond to audio data of different time frames in the audio data stream. 12. The method according to claim 1 , wherein the obtaining the loss function value comprises: obtaining, through fusion, a loss function value relative to a series of given labels in the labeling data, to perform iterative training of updated parameters in each layer of the neural network, until a minimum loss function value is obtained; and updating a parameter corresponding to the minimum loss function value to each layer of the neural network. 13. A neural network training system for implementing audio recognition, the audio recognition system comprising: a memory storing computer program instructions; and a processor coupled to the memory and configured to execute the computer program instructions to perform: obtaining an audio data stream for neural network training of audio recognition, the audio data stream including audio data respectively corresponding to a plurality of time frames; performing, for different audio data of each time frame in the audio data stream, feature extraction in each layer of a trained neural network, to obtain a depth feature outputted by a corresponding time frame; fusing, for a given label in labeling data, an inter-class confusion measurement index and an intra-class distance penalty value relative to the given label in a set loss function for the audio data stream through the depth feature; and obtaining, through fusion, a loss function value relative to a series of given labels in the labeling data, to update a parameter in the neural network. 14. The neural network training system according to claim 13 , wherein the fusing the inter-class confusion measurement index and the intra-class distance penalty value comprises: obtaining, for the given label in the labeling data, a center vector corresponding to a category to which the given label belongs, the center vector being used for describing centers of all depth features in the category; and fusing, according to the depth feature and the center vector, an inter-class confusion measurement index and an intra-class distance penalty value relative to the given label in a set loss function for audio data of the time frame, to obtain a loss function value of the audio data relative to the given label. 15. The
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Training · CPC title
Architecture, e.g. interconnection topology · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.