What technology area does this patent fall under?

Primary CPC classification G10L17/18. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Feb 01 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Neural Networks For Speaker Verification

US2024038245A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2024038245-A1
Application number	US-202318485069-A
Country	US
Kind code	A1
Filing date	Oct 11, 2023
Priority date	Sep 4, 2015
Publication date	Feb 1, 2024
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This document generally describes systems, methods, devices, and other techniques related to speaker verification, including (i) training a neural network for a speaker verification model, (ii) enrolling users at a client device, and (iii) verifying identities of users based on characteristics of the users' voices. Some implementations include a computer-implemented method. The method can include receiving, at a computing device, data that characterizes an utterance of a user of the computing device. A speaker representation can be generated, at the computing device, for the utterance using a neural network on the computing device. The neural network can be trained based on a plurality of training samples that each: (i) include data that characterizes a first utterance and data that characterizes one or more second utterances, and (ii) are labeled as a matching speakers sample or a non-matching speakers sample.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising: receiving a plurality of training samples, each training sample comprising: a training verification utterance; a training enrollment utterance spoken by a corresponding speaker; and a first classification for the training sample that indicates whether a speaker of the training utterance is the same or different from the corresponding speaker that spoke the training enrollment utterance; training a neural network on the plurality of training samples by: for each training sample: processing, using the neural network, audio signals characterizing the training verification utterance to generate a first training speaker representation for the training verification utterance; processing, using the neural network, the audio signals characterizing the training enrollment utterance; determining a second classification for the training sample based on the first training speaker representation and the second training speaker representation, the second classification for the training sample indicating whether the speaker of the training utterance is the same or different from the corresponding speaker that spoke the training enrollment utterance; and adjusting parameters of the neural network based on a comparison of the first classification of the training sample and the second classification determined for the training sample; and transmitting the trained neural network over a network to a user device. 2 . The computer-implemented method of claim 1 , wherein the user device is configured to: receive audio signals that represent the user speaking the plurality of enrollment utterances, the audio signals representing the user speaking the plurality of enrollment utterances recorded by the user device; generate, using the trained neural configured to receive the audio signals representing the user speaking the plurality of enrollment utterances as input, a reference speaker model associated with the user that characterizes distinctive features of a voice of the user; and store the reference speaker model on memory hardware of the user device. 3 . The computer-implemented method of claim 2 , wherein the trained neural network comprises a trained neural network comprising a long short-term memory (LSTM) layer configured to receive the audio signals representing the user speaking the plurality of enrollment utterances as input. 4 . The computer-implemented method of claim 3 , wherein the user device is further configured to: obtain a plurality of audio frames representing a first utterance; generate, using the trained neural network, a speaker representation for the utterance, the speaker representation indicating distinctive features of a speaker of the first utterance; determine a similarity score between the speaker representation for the first utterance and the reference speaker model stored on the memory hardware of the user device satisfies a similarity score threshold; and authenticate the speaker of the first utterance as the user associated with the reference speaker model based on determining the similarity score satisfies the similarity score threshold. 5 . The computer-implemented method of claim 4 , wherein the trained neural network further comprises a fully-connected linear layer configured to: receive, as input, an output of the LSTM layer; and generate, as output, the speaker representation for the first utterance. 6 . The computer-implemented method of claim 4 , wherein the user device is further configured to, in response to authenticating the speaker of the first utterance as the user associated with the reference speaker model, transition operation of the user device from a low-power state to a more fully-featured state. 7 . The computer-implemented method of claim 1 , wherein the training verification utterance and the training enrollment utterance comprise a same pre-determined phrase. 8 . The computer-implemented method of claim 1 , wherein the comparison of the first classification of the training sample and the second classification determined for the training sample comprises a cosine distance between a vector of values for the first classification and a vector of values for the second classification. 9 . The computer-implemented method of claim 1 , wherein the user device comprises a smart phone. 10 . The computer-implemented method of claim 1 , wherein the data processing hardware resides on a remote computing device. 11 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed by the data processing hardware cause the data processing hardware to perform operations comprising: receiving a plurality of training samples, each training sample comprising: a training verification utterance; a training enrollment utterance spoken by a corresponding speaker; and a first classification for the training sample that indicates whether a speaker of the training utterance is the same or different from the corresponding speaker that spoke the training enrollment utterance; training a neural network on the plurality of training samples by: for each training sample: processing, using the neural network, audio signals characterizing the training verification utterance to generate a first training speaker representation for the training verification utterance; processing, using the neural network, the audio signals characterizing the training enrollment utterance; determining a second classification for the training sample based on the first training speaker representation and the second training speaker representation, the second classification for the training sample indicating whether the speaker of the training utterance is the same or different from the corresponding speaker that spoke the training enrollment utterance; and adjusting parameters of the neural network based on a comparison of the first classification of the training sample and the second classification determined for the training sample; and transmitting the trained neural network over a network to a user device. 12 . The system of claim 11 , wherein the user device is configured to: receive audio signals that represent the user speaking the plurality of enrollment utterances, the audio signals representing the user speaking the plurality of enrollment utterances recorded by the user device; generate, using the trained neural configured to receive the audio signals representing the user speaking the plurality of enrollment utterances as input, a reference speaker model associated with the user that characterizes distinctive features of a voice of the user; and store the reference speaker model on memory hardware of the user device. 13 . The system of claim 12 , wherein the trained neural network comprises a trained neural network comprising a long short-term memory (LSTM) layer configured to receive the audio signals representing the user speaking the plurality of enrollment utterances as input. 14 . The system of claim 13 , wherein the user device is further configured to: obtain a plurality of audio frames representing a first utterance; generate, using the trained neural network, a speaker representation for the utterance, the speaker representation indicating distinctive features of a speaker of the first utterance; determine a similarity score between the speaker representation for the first utterance and the refere

Assignees

Google Llc

Inventors

Classifications

G10L17/18Primary
Artificial neural networks; Connectionist approaches · CPC title
G10L17/04
Training, enrolment or model building · CPC title
G10L17/02Primary
Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction · CPC title
G07C9/37Primary
using biometric data, e.g. fingerprints, iris scans or voice recognition · CPC title

Patent family

Related publications grouped by family.

View patent family 56853791

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024038245A1 cover?: This document generally describes systems, methods, devices, and other techniques related to speaker verification, including (i) training a neural network for a speaker verification model, (ii) enrolling users at a client device, and (iii) verifying identities of users based on characteristics of the users' voices. Some implementations include a computer-implemented method. The method can inclu…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G10L17/18. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Feb 01 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).