What technology area does this patent fall under?

Primary CPC classification G10L17/04. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 25 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Speaker template update with embedding vectors based on distance metric

US11017783B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11017783-B2
Application number	US-201916296733-A
Country	US
Kind code	B2
Filing date	Mar 8, 2019
Priority date	Mar 8, 2019
Publication date	May 25, 2021
Grant date	May 25, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A device includes a processor configured to determine a feature vector based on an utterance and to determine a first embedding vector by processing the feature vector using a trained embedding network. The processor is configured to determine a first distance metric based on distances between the first embedding vector and each embedding vector of a speaker template. The processor is configured to determine, based on the first distance metric, that the utterance is verified to be from a particular user. The processor is configured to, based on a comparison of a first particular distance metric associated with the first embedding vector to a second distance metric associated with a first test embedding vector of the speaker template, generate an updated speaker template by adding the first embedding vector as a second test embedding vector and removing the first test embedding vector from test embedding vectors of the speaker template.

First claim

Opening claim text (preview).

What is claimed is: 1. A device comprising: a memory configured to store: a trained embedding network; and a speaker template associated with a first user, the speaker template including one or more enrollment embedding vectors based on initial user enrollment data and including one or more test embedding vectors; and a processor coupled to the memory, the processor configured to: determine a first feature vector based on a first utterance; determine a first embedding vector based on the first feature vector by processing the first feature vector using the trained embedding network; determine a first distance metric based on distances between the first embedding vector and each embedding vector of the speaker template; perform a speaker verification operation to determine, based on the first distance metric, whether the first utterance is verified to be from the first user; based on determining that the first utterance is verified to be from the first user, perform a comparison of a first particular distance metric associated with the first embedding vector to a second distance metric associated with a first test embedding vector of the speaker template; based on the comparison, generate an updated speaker template by adding the first embedding vector as a second test embedding vector and removing the first test embedding vector from the test embedding vectors of the speaker template; generate a set of triplets based on training embedding vectors associated with a second user, a particular triplet including a first training embedding vector associated with a first training utterance of the first user, a second training embedding vector associated with a second training utterance of the first user, and a third training embedding vector associated with a third training utterance of the second user; determine distance metrics corresponding to the set of triplets, a first distance metric of the particular triplet based on a difference between a first distance and a second distance, wherein the first distance is between the first training embedding vector and the second training embedding vector, and wherein the second distance is between the first training embedding vector and the third training embedding vector; select a first subset of the set of triplets based on the distance metrics, the particular triplet selected in the first subset based on determining that the first distance metric satisfies a tolerance threshold; and generate the trained embedding network by training the embedding network using the first subset of the set of triplets prior to training the embedding network using one or more remaining subsets of the set of triplets. 2. The device of claim 1 , wherein the processor is configured to determine that the first utterance is verified to be from the first user based on determining that the first distance metric satisfies a speaker verification threshold. 3. The device of claim 1 , wherein the processor is further configured to determine the first particular distance metric based on second distances between the first embedding vector and each of the enrollment embedding vectors of the speaker template. 4. The device of claim 1 , wherein the processor is configured to update the speaker template based on determining that the first particular distance metric is less than the second distance metric. 5. The device of claim 1 , wherein the second distance metric of the first test embedding vector is highest among distance metrics associated with the test embedding vectors. 6. The device of claim 1 , further comprising a microphone coupled to the processor, the microphone configured to receive the first utterance. 7. The device of claim 1 , wherein the processor is configured to: generate the initial user enrollment data during an initial enrollment period; and generate the test embedding vectors based on utterances received during a verification period that is subsequent to the initial enrollment period. 8. The device of claim 1 , wherein the processor is configured to, based on determining that a count of the test embedding vectors fails to satisfy a count threshold, perform the comparison of the first particular distance metric to the second distance metric. 9. The device of claim 1 , wherein the processor is configured to, subsequent to generating the updated speaker template and based on determining that a model check condition is satisfied, generate a third distance metric for the second test embedding vector, the third distance metric based on distances between the second test embedding vector and each of the enrollment embedding vectors. 10. The device of claim 9 , wherein the processor is configured to, based on determining that the third distance metric fails to satisfy a trusted distance threshold, generate an alert requesting re-enrollment of the first user. 11. The device of claim 9 , wherein the processor is configured to, based on determining that the third distance metric fails to satisfy a trusted distance threshold, modify the speaker template by removing the test embedding vectors from the speaker template. 12. The device of claim 9 , wherein the processor is configured to determine that the model check condition is satisfied based on determining that a count of the test embedding vectors is greater than or equal to a first threshold, detecting expiration of a model check time period, determining that a count of processed utterances is greater than or equal to a second threshold, or a combination thereof. 13. A method of speaker verification, the method comprising: determining, at a device, a first feature vector based on a first utterance of a first user; determining, at the device, a first embedding vector based on the first feature vector by processing the first feature vector using a trained embedding network; determining, at the device, a first distance metric based on distances between the first embedding vector and each embedding vector of a speaker template associated with the first user, the speaker template including one or more enrollment embedding vectors based on initial user enrollment data and including one or more test embedding vectors; determining, at the device, that the first utterance is verified to be from the first user based on determining that the first distance metric satisfies a speaker verification threshold; based on determining that the first utterance is verified to be from the first user, performing a comparison of a first particular distance metric associated with the first embedding vector to a second distance metric associated with a first test embedding vector of the speaker template; based on the comparison, generating an updated speaker template by adding the first embedding vector as a second test embedding vector and removing the first test embedding vector from the test embedding vectors of the speaker template; generating a set of triplets based on training embedding vectors associated with a second user, a particular triplet including a first training embedding vector associated with a first training utterance of the first user, a second training embedding vector associated with a second training utterance of the first user, and a third training embedding vector associated with a third training utterance of the second user; determining distance metrics corresponding to the set of triplets, a first distance metric of the particular triplet based on a difference between a first distance and a second distance, wherein the first distance is between the first training embedding vector and the second training embedding vector, and wherein the second distance is between the firs

Assignees

Qualcomm Inc

Inventors

Classifications

G10L17/06
Decision making techniques; Pattern matching strategies · CPC title
G10L17/04Primary
Training, enrolment or model building · CPC title
G10L17/08
Use of distortion metrics or a particular distance between probe pattern and reference templates · CPC title
G10L17/02
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
G10L17/00
Speaker identification or verification techniques · CPC title

Patent family

Related publications grouped by family.

View patent family 72334997

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11017783B2 cover?: A device includes a processor configured to determine a feature vector based on an utterance and to determine a first embedding vector by processing the feature vector using a trained embedding network. The processor is configured to determine a first distance metric based on distances between the first embedding vector and each embedding vector of a speaker template. The processor is configure…
Who is the assignee on this patent?: Qualcomm Inc
What technology area does this patent fall under?: Primary CPC classification G10L17/04. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 25 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).