What technology area does this patent fall under?

Primary CPC classification G10L17/18. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 08 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Fully supervised speaker diarization

US11031017B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11031017-B2
Application number	US-201916242541-A
Country	US
Kind code	B2
Filing date	Jan 8, 2019
Priority date	Jan 8, 2019
Publication date	Jun 8, 2021
Grant date	Jun 8, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving an utterance of speech and segmenting the utterance of speech into a plurality of segments. For each segment of the utterance of speech, the method also includes extracting a speaker-discriminative embedding from the segment and predicting a probability distribution over possible speakers for the segment using a probabilistic generative model configured to receive the extracted speaker-discriminative embedding as a feature input. The probabilistic generative model trained on a corpus of training speech utterances each segmented into a plurality of training segments. Each training segment including a corresponding speaker-discriminative embedding and a corresponding speaker label. The method also includes assigning a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, at data processing hardware, an utterance of speech; segmenting, by the data processing hardware, the utterance of speech into a plurality of segments; for each segment of the utterance of speech: extracting, by the data processing hardware, a speaker-discriminative embedding from the segment; and predicting, by the data processing hardware, a probability distribution over possible speakers for the segment using a model configured to receive the extracted speaker-discriminative embedding as a feature input, the model trained on a corpus of training speech utterances, each training speech utterance segmented into a plurality of training segments, each training segment comprising a corresponding speaker-discriminative embedding and a corresponding speaker label; and assigning, by the data processing hardware, a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment. 2. The method of claim 1 , wherein the model predicts the probability distribution over possible speakers for each segment by applying a distance-dependent Chinese restaurant process. 3. The method of claim 1 , wherein predicting the probability distribution over possible speakers for the segment comprises, when the segment occurs after an initial segment of the plurality of segments: predicting a probability that a current speaker associated with the speaker label assigned to a previous adjacent segment will not change for the segment; for each existing speaker associated with a corresponding speaker label previously assigned to one or more previous segments, predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker; and predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to a new speaker. 4. The method of claim 3 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker is proportional to a number of instances the corresponding speaker label associated with the existing speaker was previously assigned. 5. The method of claim 3 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the new speaker is proportional to a speaker assignment probability parameter α. 6. The method of claim 1 , wherein the model is further configured to, for each segment occurring after an initial segment of the plurality of segments, receive the speaker-discriminative embedding extracted from a previous adjacent segment and the speaker label assigned to the previous adjacent segment as feature inputs for predicting a probability that a speaker will not change for the corresponding segment. 7. The method of claim 1 , wherein assigning the speaker label to each segment of the utterance of speech comprises assigning the speaker label to each segment of the utterance of speech by executing a greedy search on the probability distribution over possible speaker for the corresponding segment. 8. The method of claim 1 , wherein extracting the speaker-discriminative embedding from the segment comprises extracting a d-vector from the segment. 9. The method of claim 1 , wherein extracting the speaker-discriminative embedding from the segment comprises extracting an i-vector from the segment. 10. The method of claim 1 , wherein the model comprises a recurrent neural network (RNN). 11. The method of claim 10 , wherein the RNN comprises: a hidden layer with N gated recurrent unit (GRU) cells, each GRU cell configured to apply hyperbolic tangent (tanh) activation; and two fully-connected layers each having N nodes and configured to apply a rectified linear unit (ReLU) activation of the hidden layer. 12. The method of claim 1 , further comprising: transcribing, by the data processing hardware, the utterance of speech into corresponding text; and annotating, by the data processing hardware, the text based on the speaker label assigned to each segment of the utterance of speech. 13. The method of claim 1 , wherein segmenting the utterance of speech into a plurality of segments comprises segmenting the utterance of speech into a plurality of fixed-length segments. 14. The method of claim 1 , wherein segmenting the utterance of speech into a plurality of segments comprises segmenting the utterance of speech into a plurality of variable-length segments. 15. A system comprising: data processing hardware; memory hardware in communication with the data processing hardware and storing instructions, that when executed by the data processing hardware, cause the data processing hardware to perform operations comprising: receiving an utterance of speech; segmenting the utterance of speech into a plurality of segments; for each segment of the utterance of speech: extracting a speaker-discriminative embedding from the segment; and predicting a probability distribution over possible speakers for the segment using a model configured to receive the extracted speaker-discriminative embedding as a feature input, the model trained on a corpus of training speech utterances, each training speech utterance segmented into a plurality of training segments, each training segment comprising a corresponding speaker-discriminative embedding and a corresponding speaker label; and assigning a speaker label to each segment of the utterance of speech based on the probability distribution over possible speakers for the corresponding segment. 16. The system of claim 15 , wherein the model predicts the probability distribution over possible speakers for each segment by applying a distance-dependent Chinese restaurant process. 17. The system of claim 15 , wherein predicting the probability distribution over possible speakers for the segment comprises, when the segment occurs after an initial segment of the plurality of segments: predicting a probability that a current speaker associated with the speaker label assigned to a previous adjacent segment will not change for the segment; for each existing speaker associated with a corresponding speaker label previously assigned to one or more previous segments, predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker; and predicting a probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to a new speaker. 18. The system of claim 17 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the existing speaker is proportional to a number of instances the corresponding speaker label associated with the existing speaker was previously assigned. 19. The system of claim 17 , wherein the probability that the current speaker associated with the speaker label assigned to the previous adjacent segment will change to the new speaker is proportional to a speaker assignment probability parameter α. 20. The system of claim 15 , wherein the model is further configured to, for each segment occurring after an initial segment of the plurality of segments, receive the speaker-discriminative embedding ext

Assignees

Google Llc

Inventors

Classifications

G10L25/30
using neural networks · CPC title
G10L25/87
Detection of discrete points within a voice signal · CPC title
G10L17/18Primary
Artificial neural networks; Connectionist approaches · CPC title
G10L17/04Primary
Training, enrolment or model building · CPC title
G10L17/02
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title

Patent family

Related publications grouped by family.

View patent family 68841182

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11031017B2 cover?: A method includes receiving an utterance of speech and segmenting the utterance of speech into a plurality of segments. For each segment of the utterance of speech, the method also includes extracting a speaker-discriminative embedding from the segment and predicting a probability distribution over possible speakers for the segment using a probabilistic generative model configured to receive th…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L17/18. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 08 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).