Fully supervised speaker diarization

US11031017B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11031017-B2
Application numberUS-201916242541-A
CountryUS
Kind codeB2
Filing dateJan 8, 2019
Priority dateJan 8, 2019
Publication dateJun 8, 2021
Grant dateJun 8, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving an utterance of speech and segmenting the utterance of speech into a plurality of segments. For each segment of the utterance of speech, the method also includes extracting a speaker-discriminative embedding from the segment and predicting a probability distribution over possible speakers for the segment using a probabilistic generative model configured to receive the extracted speaker-discriminative embedding as a feature input. The probabilistic generative model trained on a corpus of training speech utterances each segmented into a plurality of training segments. Each training segment including a corresponding speaker-discriminative embedding and a corresponding speaker label. The method also includes assigning a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, at data processing hardware, an utterance of speech; segmenting, by the data processing hardware, the utterance of speech into a plurality of segments; for each segment of the utterance of speech: extracting, by the data processing hardware, a speaker-discriminative embedding from the segment; and predicting, by the data processing hardware, a probability distribution over possible speakers for the segment using a model configured to receive the extracted speaker-discriminative embedding as a feature input, the model trained on a corpus of training speech utterances, each training speech utterance segmented into a plurality of training segments, each training segment comprising a corresponding speaker-discriminative embedding and a corresponding speaker label; and assigning, by the data processing hardware, a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment. 2. The method of claim 1 , wherein the model predicts the probability distribution over possible speakers for each segment by applying a distance-dependent Chinese restaurant process. 3. The method of claim 1 , wherein predicting the probability distribution over possible speakers for the segment comprises, when the segment occurs after an initial segment of the plurality of segments: predicting a probability that a current speaker associated with the speaker label assigned to a previous adjacent segment will not change for the segment; for each existing speaker associated with a corresponding speaker label previously assigned to one or more previous segments, predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker; and predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to a new speaker. 4. The method of claim 3 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker is proportional to a number of instances the corresponding speaker label associated with the existing speaker was previously assigned. 5. The method of claim 3 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the new speaker is proportional to a speaker assignment probability parameter α. 6. The method of claim 1 , wherein the model is further configured to, for each segment occurring after an initial segment of the plurality of segments, receive the speaker-discriminative embedding extracted from a previous adjacent segment and the speaker label assigned to the previous adjacent segment as feature inputs for predicting a probability that a speaker will not change for the corresponding segment. 7. The method of claim 1 , wherein assigning the speaker label to each segment of the utterance of speech comprises assigning the speaker label to each segment of the utterance of speech by executing a greedy search on the probability distribution over possible speaker for the corresponding segment. 8. The method of claim 1 , wherein extracting the speaker-discriminative embedding from the segment comprises extracting a d-vector from the segment. 9. The method of claim 1 , wherein extracting the speaker-discriminative embedding from the segment comprises extracting an i-vector from the segment. 10. The method of claim 1 , wherein the model comprises a recurrent neural network (RNN). 11. The method of claim 10 , wherein the RNN comprises: a hidden layer with N gated recurrent unit (GRU) cells, each GRU cell configured to apply hyperbolic tangent (tanh) activation; and two fully-connected layers each having N nodes and configured to apply a rectified linear unit (ReLU) activation of the hidden layer. 12. The method of claim 1 , further comprising: transcribing, by the data processing hardware, the utterance of speech into corresponding text; and annotating, by the data processing hardware, the text based on the speaker label assigned to each segment of the utterance of speech. 13. The method of claim 1 , wherein segmenting the utterance of speech into a plurality of segments comprises segmenting the utterance of speech into a plurality of fixed-length segments. 14. The method of claim 1 , wherein segmenting the utterance of speech into a plurality of segments comprises segmenting the utterance of speech into a plurality of variable-length segments. 15. A system comprising: data processing hardware; memory hardware in communication with the data processing hardware and storing instructions, that when executed by the data processing hardware, cause the data processing hardware to perform operations comprising: receiving an utterance of speech; segmenting the utterance of speech into a plurality of segments; for each segment of the utterance of speech: extracting a speaker-discriminative embedding from the segment; and predicting a probability distribution over possible speakers for the segment using a model configured to receive the extracted speaker-discriminative embedding as a feature input, the model trained on a corpus of training speech utterances, each training speech utterance segmented into a plurality of training segments, each training segment comprising a corresponding speaker-discriminative embedding and a corresponding speaker label; and assigning a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment. 16. The system of claim 15 , wherein the model predicts the probability distribution over possible speakers for each segment by applying a distance-dependent Chinese restaurant process. 17. The system of claim 15 , wherein predicting the probability distribution over possible speakers for the segment comprises, when the segment occurs after an initial segment of the plurality of segments: predicting a probability that a current speaker associated with the speaker label assigned to a previous adjacent segment will not change for the segment; for each existing speaker associated with a corresponding speaker label previously assigned to one or more previous segments, predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker; and predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to a new speaker. 18. The system of claim 17 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker is proportional to a number of instances the corresponding speaker label associated with the existing speaker was previously assigned. 19. The system of claim 17 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the new speaker is proportional to a speaker assignment probability parameter α. 20. The system of claim 15 , wherein the model is further configured to, for each segment occurring after an initial segment of the plurality of segments, receive the speaker-discriminative embedding ext

Assignees

Inventors

Classifications

  • using neural networks · CPC title

  • Detection of discrete points within a voice signal · CPC title

  • G10L17/18Primary

    Artificial neural networks; Connectionist approaches · CPC title

  • G10L17/04Primary

    Training, enrolment or model building · CPC title

  • Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11031017B2 cover?
A method includes receiving an utterance of speech and segmenting the utterance of speech into a plurality of segments. For each segment of the utterance of speech, the method also includes extracting a speaker-discriminative embedding from the segment and predicting a probability distribution over possible speakers for the segment using a probabilistic generative model configured to receive th…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L17/18. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 08 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).