Semi-supervised speaker diarization
US-10133538-B2 · Nov 20, 2018 · US
US11031017B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11031017-B2 |
| Application number | US-201916242541-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 8, 2019 |
| Priority date | Jan 8, 2019 |
| Publication date | Jun 8, 2021 |
| Grant date | Jun 8, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes receiving an utterance of speech and segmenting the utterance of speech into a plurality of segments. For each segment of the utterance of speech, the method also includes extracting a speaker-discriminative embedding from the segment and predicting a probability distribution over possible speakers for the segment using a probabilistic generative model configured to receive the extracted speaker-discriminative embedding as a feature input. The probabilistic generative model trained on a corpus of training speech utterances each segmented into a plurality of training segments. Each training segment including a corresponding speaker-discriminative embedding and a corresponding speaker label. The method also includes assigning a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment.
Opening claim text (preview).
What is claimed is: 1. A method comprising: receiving, at data processing hardware, an utterance of speech; segmenting, by the data processing hardware, the utterance of speech into a plurality of segments; for each segment of the utterance of speech: extracting, by the data processing hardware, a speaker-discriminative embedding from the segment; and predicting, by the data processing hardware, a probability distribution over possible speakers for the segment using a model configured to receive the extracted speaker-discriminative embedding as a feature input, the model trained on a corpus of training speech utterances, each training speech utterance segmented into a plurality of training segments, each training segment comprising a corresponding speaker-discriminative embedding and a corresponding speaker label; and assigning, by the data processing hardware, a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment. 2. The method of claim 1 , wherein the model predicts the probability distribution over possible speakers for each segment by applying a distance-dependent Chinese restaurant process. 3. The method of claim 1 , wherein predicting the probability distribution over possible speakers for the segment comprises, when the segment occurs after an initial segment of the plurality of segments: predicting a probability that a current speaker associated with the speaker label assigned to a previous adjacent segment will not change for the segment; for each existing speaker associated with a corresponding speaker label previously assigned to one or more previous segments, predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker; and predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to a new speaker. 4. The method of claim 3 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker is proportional to a number of instances the corresponding speaker label associated with the existing speaker was previously assigned. 5. The method of claim 3 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the new speaker is proportional to a speaker assignment probability parameter α. 6. The method of claim 1 , wherein the model is further configured to, for each segment occurring after an initial segment of the plurality of segments, receive the speaker-discriminative embedding extracted from a previous adjacent segment and the speaker label assigned to the previous adjacent segment as feature inputs for predicting a probability that a speaker will not change for the corresponding segment. 7. The method of claim 1 , wherein assigning the speaker label to each segment of the utterance of speech comprises assigning the speaker label to each segment of the utterance of speech by executing a greedy search on the probability distribution over possible speaker for the corresponding segment. 8. The method of claim 1 , wherein extracting the speaker-discriminative embedding from the segment comprises extracting a d-vector from the segment. 9. The method of claim 1 , wherein extracting the speaker-discriminative embedding from the segment comprises extracting an i-vector from the segment. 10. The method of claim 1 , wherein the model comprises a recurrent neural network (RNN). 11. The method of claim 10 , wherein the RNN comprises: a hidden layer with N gated recurrent unit (GRU) cells, each GRU cell configured to apply hyperbolic tangent (tanh) activation; and two fully-connected layers each having N nodes and configured to apply a rectified linear unit (ReLU) activation of the hidden layer. 12. The method of claim 1 , further comprising: transcribing, by the data processing hardware, the utterance of speech into corresponding text; and annotating, by the data processing hardware, the text based on the speaker label assigned to each segment of the utterance of speech. 13. The method of claim 1 , wherein segmenting the utterance of speech into a plurality of segments comprises segmenting the utterance of speech into a plurality of fixed-length segments. 14. The method of claim 1 , wherein segmenting the utterance of speech into a plurality of segments comprises segmenting the utterance of speech into a plurality of variable-length segments. 15. A system comprising: data processing hardware; memory hardware in communication with the data processing hardware and storing instructions, that when executed by the data processing hardware, cause the data processing hardware to perform operations comprising: receiving an utterance of speech; segmenting the utterance of speech into a plurality of segments; for each segment of the utterance of speech: extracting a speaker-discriminative embedding from the segment; and predicting a probability distribution over possible speakers for the segment using a model configured to receive the extracted speaker-discriminative embedding as a feature input, the model trained on a corpus of training speech utterances, each training speech utterance segmented into a plurality of training segments, each training segment comprising a corresponding speaker-discriminative embedding and a corresponding speaker label; and assigning a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment. 16. The system of claim 15 , wherein the model predicts the probability distribution over possible speakers for each segment by applying a distance-dependent Chinese restaurant process. 17. The system of claim 15 , wherein predicting the probability distribution over possible speakers for the segment comprises, when the segment occurs after an initial segment of the plurality of segments: predicting a probability that a current speaker associated with the speaker label assigned to a previous adjacent segment will not change for the segment; for each existing speaker associated with a corresponding speaker label previously assigned to one or more previous segments, predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker; and predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to a new speaker. 18. The system of claim 17 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker is proportional to a number of instances the corresponding speaker label associated with the existing speaker was previously assigned. 19. The system of claim 17 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the new speaker is proportional to a speaker assignment probability parameter α. 20. The system of claim 15 , wherein the model is further configured to, for each segment occurring after an initial segment of the plurality of segments, receive the speaker-discriminative embedding ext
using neural networks · CPC title
Detection of discrete points within a voice signal · CPC title
Artificial neural networks; Connectionist approaches · CPC title
Training, enrolment or model building · CPC title
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.