Neural networks for speaker verification

US10325602B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10325602-B2
Application numberUS-201715666806-A
CountryUS
Kind codeB2
Filing dateAug 2, 2017
Priority dateAug 2, 2017
Publication dateJun 18, 2019
Grant dateJun 18, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, devices, and other techniques for training and using a speaker verification neural network. A computing device may receive data that characterizes a first utterance. The computing device provides the data that characterizes the utterance to a speaker verification neural network. Subsequently, the computing device obtains, from the speaker verification neural network, a speaker representation that indicates speaking characteristics of a speaker of the first utterance. The computing device determines whether the first utterance is classified as an utterance of a registered user of the computing device. In response to determining that the first utterance is classified as an utterance of the registered user of the computing device, the device may perform an action for the registered user of the computing device.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: receiving, by a computing device, data that characterizes a first utterance; providing, by the computing device, the data that characterizes the utterance to a speaker verification neural network, wherein the speaker verification neural network is trained on batches of training utterances using a respective training loss for each batch that is based on, for each of multiple training speakers represented in the batch: (i) differences among speaker representations generated by the speaker verification neural network from training utterances of the training speaker within the batch, and (ii) for each first speaker representation generated from a training utterance of the training speaker within the batch, a similarity between the first speaker representation and a second speaker representation for a particular different training speaker represented in the batch, the particular different training speaker selected from among the multiple training speakers represented in the batch based on a distance between the first speaker representation generated from the training utterance of the training speaker and the second speaker representation for the particular different training speaker, the second speaker representation determined based on multiple training utterances of the particular different training speaker; obtaining, by the computing device, a speaker representation that indicates speaking characteristics of a speaker of the first utterance, wherein the speaker representation was generated by processing the data that characterizes the first utterance with the speaker verification neural network; determining, by the computing device and based on the speaker representation, whether the first utterance is classified as an utterance of a registered user of the computing device; and in response to determining that the first utterance is classified as an utterance of the registered user of the computing device, performing, by the computing device, an action for the registered user of the computing device. 2. The computer-implemented method of claim 1 , wherein determining whether the first utterance is classified as an utterance of the registered user of the computing device comprises comparing the speaker representation for the first utterance to a speaker signature for the registered user, wherein the speaker signature is based on one or more speaker representations derived from one or more enrollment utterances of the registered user. 3. The computer-implemented method of claim 1 , wherein the registered user is a first registered user; the method comprising: comparing the speaker representation for the first utterance to respective speaker signatures for multiple registered users of the computing device including the first registered user to determine a respective distance between the speaker representation for the first utterance and the respective speaker signatures for the multiple registered users; and determining that the first utterance is classified as an utterance of the first registered user of the computing device based on the respective distance between the speaker representation for the first utterance and the respective speaker signature for the first registered user being less than a threshold distance from each other. 4. The computer-implemented method of claim 1 , wherein: the speaker verification neural network is stored locally on the computing device; and obtaining the speaker representation comprises processing the data that characterizes a first utterance with the speaker verification neural network on the computing device. 5. The computer-implemented method of claim 1 , wherein the second speaker representation is an averaged speaker representation generated by averaging speaker representations for the multiple training utterances of the particular different training speaker. 6. The computer-implemented method of claim 1 , wherein for each training speaker of multiple training speakers represented in a batch, the differences among speaker representations generated by the speaker verification neural network from training utterances of the training speaker within the batch are determined based on distances of the speaker representations of the training speaker to an averaged speaker representation generated from two or more training utterances of the training speaker. 7. The computer-implemented method of claim 1 , wherein the speaker verification neural network is a long short-term memory (LSTM) neural network. 8. The computer-implemented method of claim 1 , wherein the data that characterizes the first utterance is feature data that characterizes acoustic features of the first utterance; and the method further comprises generating the feature data for the first utterance from audio data for the first utterance that characterizes an audio waveform of the first utterance. 9. The computer-implemented method of claim 1 , wherein performing the action that is assigned to the registered user of the computing device comprises transitioning the computing device from a locked state to an unlocked state. 10. The computer-implemented method of claim 1 , wherein performing the action that is assigned to the registered user of the computing device comprises accessing user data from a user account of the registered user of the computing device. 11. A computer-implemented method for training a speaker verification neural network, comprising: obtaining, by a computing system, a training batch that includes a plurality of groups of training samples, wherein: (i) each training sample in the training batch characterizes a respective training utterance for the training sample, and (ii) each of the plurality of groups of training samples corresponds to a different speaker such that each group consists of training samples that characterize training utterances of a same speaker that is different from the speakers of training utterances characterized by training samples in other ones of the plurality of groups of training samples; for each training sample in the training batch, processing the training sample with the speaker verification neural network in accordance with current values of internal parameters of the speaker verification neural network to generate a speaker representation for the training sample that indicates speaker characteristics of a speaker of the respective training utterance characterized by the training sample; for each group of training samples, averaging the speaker representations for training samples in the group to generate an averaged speaker representation for the group; for each training sample in the training batch, determining a loss component for the speaker representation for the training sample based on: (i) a distance between the speaker representation for the training sample and the averaged speaker representation for the group to which the training sample belongs, and (ii) a distance between the speaker representation for the training sample and a closest averaged speaker representation among the averaged speaker representations for the groups to which the training sample does not belong; and updating the current values of the internal parameters of the speaker verification neural network using the loss components for the speaker representations for at least some of the training samples in the training batch. 12. The computer-implemented method of claim 11 , further comprising iteratively updating the current values of the internal parameters of the speaker verification neural network over a plurality of training iterations, wherein the computing system

Assignees

Inventors

Classifications

  • G10L17/18Primary

    Artificial neural networks; Connectionist approaches · CPC title

  • G10L17/22Primary

    Interactive procedures; Man-machine interfaces · CPC title

  • Training, enrolment or model building · CPC title

  • Decision making techniques; Pattern matching strategies · CPC title

  • Speaker identification or verification techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10325602B2 cover?
Systems, methods, devices, and other techniques for training and using a speaker verification neural network. A computing device may receive data that characterizes a first utterance. The computing device provides the data that characterizes the utterance to a speaker verification neural network. Subsequently, the computing device obtains, from the speaker verification neural network, a speaker…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G10L17/18. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 18 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).