Neural networks for speaker verification

US11961525B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11961525-B2
Application numberUS-202117444384-A
CountryUS
Kind codeB2
Filing dateAug 3, 2021
Priority dateSep 4, 2015
Publication dateApr 16, 2024
Grant dateApr 16, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This document generally describes systems, methods, devices, and other techniques related to speaker verification, including (i) training a neural network for a speaker verification model, (ii) enrolling users at a client device, and (iii) verifying identities of users based on characteristics of the users' voices. Some implementations include a computer-implemented method. The method can include receiving, at a computing device, data that characterizes an utterance of a user of the computing device. A speaker representation can be generated, at the computing device, for the utterance using a neural network on the computing device. The neural network can be trained based on a plurality of training samples that each: (i) include data that characterizes a first utterance and data that characterizes one or more second utterances, and (ii) are labeled as a matching speakers sample or a non-matching speakers sample.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method when executed on data processing hardware of a user device causes the data processing hardware to perform operations comprising: prompting a user of the user device to speak a plurality of enrollment utterances; receiving audio signals that represent the user speaking the plurality of enrollment utterances, the audio signals representing the user speaking the plurality of enrollment utterances recorded by the user device; generating, using a trained neural network, a reference speaker model based on the audio signals that represent the user speaking the plurality of enrollment utterances, the reference speaker model associated with the user and characterizing distinctive features of a voice of the user, the trained neural network comprising: a long short-term memory (LSTM) layer configured to receive, as input, the audio signals representing the user speaking the plurality of enrollment utterances and generate an enrollment LSTM output; and a fully-connected linear layer configured to receive, as input, the enrollment LSTM output and generate, as output, the reference speaker model, wherein the fully-connected linear layer comprises a last layer of the trained neural network; storing the reference speaker model on memory hardware of the user device; obtaining a plurality of audio frames representing a text-independent utterance different than each of the plurality of enrollment utterances; generating, using the trained neural network, a speaker representation for the text-independent utterance, the speaker representation indicating distinctive features of a voice of a speaker of the text-independent utterance, wherein the speaker representation is generated as output from the fully-connected linear layer of the trained neural network; determining that a similarity score between the speaker representation for the text-independent utterance and the reference speaker model stored on the memory hardware of the user device satisfies a similarity score threshold; and based on determining that the similarity score satisfies the similarity score threshold, authenticating the speaker of the text-independent utterance as the user associated with the reference speaker model. 2. The computer-implemented method of claim 1 , wherein: the LSTM layer is further configured to receive, as input, the plurality of audio frames representing the text-independent utterance and generate a verification LSTM output; and the fully-connected linear layer is further configured to receive, as input, the verification LSTM output and generate, as output, the speaker representation for the text-independent utterance. 3. The computer-implemented method of claim 1 , wherein the operations further comprise, based on determining that the similarity score satisfies the similarity score threshold, updating the reference speaker model associated with the user of the user device based on the text-independent utterance. 4. The computer-implemented method of claim 1 , wherein the operations further comprise, in response to authenticating the speaker of the text-independent utterance as the user associated with the reference speaker model, transitioning operation of the user device from a low-power state to a more fully-featured state. 5. The computer-implemented method of claim 1 , wherein the operations further comprise, in response to authenticating the speaker of the text-independent utterance as the user associated with the reference speaker model: processing one or more terms in the text-independent utterance; and performing an action based on the one or more terms in the text-independent utterance. 6. The computer-implemented method of claim 1 , wherein the text-independent utterance comprises a non-predefined phrase. 7. The computer-implemented method of claim 1 , wherein the similarity score between the speaker representation and the reference speaker model comprises a cosine distance between a vector of values for the speaker representation and a vector of values for the reference speaker model. 8. The computer-implemented method of claim 1 , wherein the trained neural network is stored on the memory hardware of the user device. 9. The computer-implemented method of claim 1 , wherein obtaining the plurality of audio frames representing the text-independent utterance comprises: receiving a raw audio signal of the text-independent utterance; segmenting the raw audio signal of the text-independent utterance into a plurality of raw audio frames, each raw audio frame comprising a respective portion of the raw audio signal; and converting the respective portion of the raw audio signal of each raw audio raw audio frame into respective audio features characterizing a respective segment of the text-independent utterance. 10. The computer-implemented method of claim 1 , wherein the operations further comprise, prior to generating the speaker representation for the text-independent utterance using the trained neural network, receiving the trained neural network over a network from a remote computing device. 11. A system comprising: data processing hardware of a user device; and memory hardware of the user device and in communication with the data processing hardware, the memory hardware storing instructions that when executed by the data processing hardware cause the data processing hardware to perform operations comprising: prompting a user of the user device to speak a plurality of enrollment utterances; receiving audio signals that represent the user speaking the plurality of enrollment utterances, the audio signals representing the user speaking the plurality of enrollment utterances recorded by the user device; generating, using a trained neural network, a reference speaker model based on the audio signals that represent the user speaking the plurality of enrollment utterances, the reference speaker model associated with the user and characterizing distinctive features of a voice of the user, the trained neural network comprising: a long short-term memory (LSTM) layer configured to receive, as input, the audio signals representing the user speaking the plurality of enrollment utterances and generate an enrollment LSTM output; and a fully-connected linear layer configured to receive, as input, the enrollment LSTM output and generate, as output, the reference speaker model, wherein the fully-connected linear layer comprises a last layer of the trained neural network; storing the reference speaker model on memory hardware of the user device; obtaining a plurality of audio frames representing a text-independent utterance comprising a different utterance from each of the plurality of enrollment utterances; generating, using the trained neural network, a speaker representation for the text-independent utterance, the speaker representation indicating distinctive features of a voice of a speaker of the text-independent utterance, wherein the speaker representation is generated as output from the fully-connected linear layer of the trained neural network; determining that a similarity score between the speaker representation for the text-independent utterance and the reference speaker model stored on the memory hardware of the user device satisfies a similarity score threshold; and based on determining that the similarity score satisfies the similarity score threshold, authenticating the speaker of the text-independent utterance as the user associated with the reference speaker model. 12. The system of claim 11 , wherein: the LSTM layer is further configured to receive, as input, the plurality of audio frames representing the text-independ

Assignees

Inventors

Classifications

  • G10L17/18Primary

    Artificial neural networks; Connectionist approaches · CPC title

  • G10L17/02Primary

    Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title

  • Training, enrolment or model building · CPC title

  • G07C9/37Primary

    using biometric data, e.g. fingerprints, iris scans or voice recognition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11961525B2 cover?
This document generally describes systems, methods, devices, and other techniques related to speaker verification, including (i) training a neural network for a speaker verification model, (ii) enrolling users at a client device, and (iii) verifying identities of users based on characteristics of the users' voices. Some implementations include a computer-implemented method. The method can inclu…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L17/18. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 16 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).