Fully Supervised Speaker Diarization
US-2020219517-A1 · Jul 9, 2020 · US
US11017783B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11017783-B2 |
| Application number | US-201916296733-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 8, 2019 |
| Priority date | Mar 8, 2019 |
| Publication date | May 25, 2021 |
| Grant date | May 25, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A device includes a processor configured to determine a feature vector based on an utterance and to determine a first embedding vector by processing the feature vector using a trained embedding network. The processor is configured to determine a first distance metric based on distances between the first embedding vector and each embedding vector of a speaker template. The processor is configured to determine, based on the first distance metric, that the utterance is verified to be from a particular user. The processor is configured to, based on a comparison of a first particular distance metric associated with the first embedding vector to a second distance metric associated with a first test embedding vector of the speaker template, generate an updated speaker template by adding the first embedding vector as a second test embedding vector and removing the first test embedding vector from test embedding vectors of the speaker template.
Opening claim text (preview).
What is claimed is: 1. A device comprising: a memory configured to store: a trained embedding network; and a speaker template associated with a first user, the speaker template including one or more enrollment embedding vectors based on initial user enrollment data and including one or more test embedding vectors; and a processor coupled to the memory, the processor configured to: determine a first feature vector based on a first utterance; determine a first embedding vector based on the first feature vector by processing the first feature vector using the trained embedding network; determine a first distance metric based on distances between the first embedding vector and each embedding vector of the speaker template; perform a speaker verification operation to determine, based on the first distance metric, whether the first utterance is verified to be from the first user; based on determining that the first utterance is verified to be from the first user, perform a comparison of a first particular distance metric associated with the first embedding vector to a second distance metric associated with a first test embedding vector of the speaker template; based on the comparison, generate an updated speaker template by adding the first embedding vector as a second test embedding vector and removing the first test embedding vector from the test embedding vectors of the speaker template; generate a set of triplets based on training embedding vectors associated with a second user, a particular triplet including a first training embedding vector associated with a first training utterance of the first user, a second training embedding vector associated with a second training utterance of the first user, and a third training embedding vector associated with a third training utterance of the second user; determine distance metrics corresponding to the set of triplets, a first distance metric of the particular triplet based on a difference between a first distance and a second distance, wherein the first distance is between the first training embedding vector and the second training embedding vector, and wherein the second distance is between the first training embedding vector and the third training embedding vector; select a first subset of the set of triplets based on the distance metrics, the particular triplet selected in the first subset based on determining that the first distance metric satisfies a tolerance threshold; and generate the trained embedding network by training the embedding network using the first subset of the set of triplets prior to training the embedding network using one or more remaining subsets of the set of triplets. 2. The device of claim 1 , wherein the processor is configured to determine that the first utterance is verified to be from the first user based on determining that the first distance metric satisfies a speaker verification threshold. 3. The device of claim 1 , wherein the processor is further configured to determine the first particular distance metric based on second distances between the first embedding vector and each of the enrollment embedding vectors of the speaker template. 4. The device of claim 1 , wherein the processor is configured to update the speaker template based on determining that the first particular distance metric is less than the second distance metric. 5. The device of claim 1 , wherein the second distance metric of the first test embedding vector is highest among distance metrics associated with the test embedding vectors. 6. The device of claim 1 , further comprising a microphone coupled to the processor, the microphone configured to receive the first utterance. 7. The device of claim 1 , wherein the processor is configured to: generate the initial user enrollment data during an initial enrollment period; and generate the test embedding vectors based on utterances received during a verification period that is subsequent to the initial enrollment period. 8. The device of claim 1 , wherein the processor is configured to, based on determining that a count of the test embedding vectors fails to satisfy a count threshold, perform the comparison of the first particular distance metric to the second distance metric. 9. The device of claim 1 , wherein the processor is configured to, subsequent to generating the updated speaker template and based on determining that a model check condition is satisfied, generate a third distance metric for the second test embedding vector, the third distance metric based on distances between the second test embedding vector and each of the enrollment embedding vectors. 10. The device of claim 9 , wherein the processor is configured to, based on determining that the third distance metric fails to satisfy a trusted distance threshold, generate an alert requesting re-enrollment of the first user. 11. The device of claim 9 , wherein the processor is configured to, based on determining that the third distance metric fails to satisfy a trusted distance threshold, modify the speaker template by removing the test embedding vectors from the speaker template. 12. The device of claim 9 , wherein the processor is configured to determine that the model check condition is satisfied based on determining that a count of the test embedding vectors is greater than or equal to a first threshold, detecting expiration of a model check time period, determining that a count of processed utterances is greater than or equal to a second threshold, or a combination thereof. 13. A method of speaker verification, the method comprising: determining, at a device, a first feature vector based on a first utterance of a first user; determining, at the device, a first embedding vector based on the first feature vector by processing the first feature vector using a trained embedding network; determining, at the device, a first distance metric based on distances between the first embedding vector and each embedding vector of a speaker template associated with the first user, the speaker template including one or more enrollment embedding vectors based on initial user enrollment data and including one or more test embedding vectors; determining, at the device, that the first utterance is verified to be from the first user based on determining that the first distance metric satisfies a speaker verification threshold; based on determining that the first utterance is verified to be from the first user, performing a comparison of a first particular distance metric associated with the first embedding vector to a second distance metric associated with a first test embedding vector of the speaker template; based on the comparison, generating an updated speaker template by adding the first embedding vector as a second test embedding vector and removing the first test embedding vector from the test embedding vectors of the speaker template; generating a set of triplets based on training embedding vectors associated with a second user, a particular triplet including a first training embedding vector associated with a first training utterance of the first user, a second training embedding vector associated with a second training utterance of the first user, and a third training embedding vector associated with a third training utterance of the second user; determining distance metrics corresponding to the set of triplets, a first distance metric of the particular triplet based on a difference between a first distance and a second distance, wherein the first distance is between the first training embedding vector and the second training embedding vector, and wherein the second distance is between the firs
Decision making techniques; Pattern matching strategies · CPC title
Training, enrolment or model building · CPC title
Use of distortion metrics or a particular distance between probe pattern and reference templates · CPC title
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
Speaker identification or verification techniques · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.