Speaker recognition using neural networks

US2016293167A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016293167-A1
Application numberUS-201615179717-A
CountryUS
Kind codeA1
Filing dateJun 10, 2016
Priority dateOct 10, 2013
Publication dateOct 6, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing speaker verification. In one aspect, a method includes accessing a neural network having an input layer that provides inputs to a first hidden layer whose nodes are respectively connected to only a proper subset of the inputs from the input layer. Speech data that corresponds to a particular utterance may be provided as input to the input layer of the neural network. A representation of activations that occur in response to the speech data at a particular layer of the neural network that was configured as a hidden layer during training of the neural network may be generated. A determination of whether the particular utterance was likely spoken by a particular speaker may be made based at least on the generated representation. An indication of whether the particular utterance was likely spoken by the particular speaker may be provided.

First claim

Opening claim text (preview).

1 . A computer-implemented method comprising: accessing a neural network having an input layer and one or more hidden layers, wherein at least one hidden layer of the one or more hidden layers has nodes that are respectively connected to only a proper subset of the inputs from a previous layer that provides input to the at least one hidden layer; inputting, to the input layer of the neural network, speech data that corresponds to a particular utterance; generating a representation of activations that occur, in response to inputting the speech data that corresponds to the particular utterance to the input layer, at a particular layer of the neural network that was configured as a hidden layer during training of the neural network; determining, based at least on the generated representation, whether the particular utterance was likely spoken by a particular speaker; and providing an indication of whether the particular utterance was likely spoken by the particular speaker. 2 . The method of claim 1 , wherein the at least one hidden layer is a locally-connected layer configured such that nodes at the at least one hidden layer respectively receive input from different subsets of data from the previous layer. 3 . The method of claim 1 , wherein each of the nodes of the at least one hidden layer receives input from a localized region of the outputs of the previous layer. 4 . The method of claim 3 , wherein each of the nodes of the at least one hidden layer receives input from a proper subset of the outputs of the previous layer that is localized in time. 5 . The method of claim 3 , wherein each of the nodes of the at least one hidden layer receives input from a proper subset of the outputs of the previous layer that is localized in frequency. 6 . The method of claim 1 , wherein each of the nodes of the at least one hidden layer receives input from a respective subset of inputs from the previous layer, the respective subset being localized in time and in frequency. 7 . The method of claim 6 , wherein the inputs provided by the previous layer indicate characteristics of the utterance at a first range of frequencies during each time frame in a first range of time; wherein for each of at least some of the nodes of the at least one hidden layer, the node is only connected to inputs from the previous layer that indicate characteristics of the utterance for a second range of frequencies during each time frame in a second range of time, wherein the second range of frequencies is a proper subset of the first range of frequencies and the second range of time is a proper subset of the first range of time. 8 . The method of claim 1 , wherein the previous layer provides a number of inputs to the at least one hidden layer; wherein, for each of the nodes of the at least one hidden layer, the neural network comprises a number of stored weight values that is less than the number of inputs to the at least one hidden layer. 9 . The method of claim 1 , wherein the at least one hidden layer is a convolutional layer. 10 . The method of claim 9 , wherein at least a group of the nodes of the at least one hidden layer are associated with a same set of weight values, wherein the neural network applies the same set of weight values to different subsets of the input for different nodes in the group. 11 . The method of claim 1 , comprising: comparing the generated representation with a reference representation of activations occurring at the particular layer of the neural network in response to speech data that corresponds to a past utterance of the particular speaker; and wherein determining whether the particular utterance was likely spoken by the particular speaker based at least on the generated representation comprises: based on comparing the generated representation and the reference representation, determining whether the particular utterance was likely spoken by the particular speaker. 12 . The method of claim 1 , wherein determining whether the particular utterance was likely spoken by the particular speaker based at least on the generated representation comprises: determining a cosine distance between the generated representation and a reference representation corresponding to the particular speaker; determining that the cosine distance satisfies a threshold; and based on determining that the cosine distance satisfies the threshold, determining that the particular utterance was likely spoken by the particular speaker. 13 . The method of claim 1 , further comprising dividing the speech data corresponding to the particular utterance into frames; and wherein generating the representation of activations occurring at the particular layer of the neural network comprises: determining, for each of multiple different frames of the speech data, a corresponding set of activations occurring at the particular layer of the neural network; and generating the representation of the activations occurring at the particular layer by averaging the sets of activations that respectively correspond to the multiple different frames. 14 . The method of claim 1 , wherein accessing the neural network comprises accessing a trained neural network that is not trained using speech of the particular speaker. 15 . The method of claim 14 , wherein accessing the neural network comprises: accessing a neural network having nodes at the first hidden layer that are each connected to a different subset of the inputs from the input layer, wherein the neural network has been trained based on activations occurring at an output layer located downstream from the particular layer. 16 . The method of claim 1 , wherein accessing the neural network comprises accessing, by a user device, a neural network stored at the user device. 17 . The method of claim 1 , comprising detecting the particular utterance at a mobile device that stores the neural network; wherein determining whether the particular utterance was likely spoken by the particular speaker comprises determining that the particular utterance was likely spoken by the particular speaker; and wherein providing an indication of whether the particular utterance was likely spoken by the particular speaker comprises unlocking or waking up the mobile device in response to determining that the particular utterance was likely spoken by the particular speaker. 18 . The method of claim 1 , wherein each node of the at least one hidden layer is connected to between 5% and 50% of the inputs from the previous layer. 19 . A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: accessing a neural network having an input layer and one or more hidden layers, wherein at least one hidden layer of the one or more hidden layers has nodes that are respectively connected to only a proper subset of the inputs from a previous layer that provides input to the at least one hidden layer; inputting, to the input layer of the neural network, speech data that corresponds to a particular utterance; generating a representation of activations that occur, in response to inputting the speech data that corresponds to the particular utterance to the input layer, at a particular layer of the neural network that was configured as a hidden layer during training of the neural network; determining, based at least on the generated representation, whether the particular utterance was lik

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • involving management of server-side video buffer · CPC title

  • by partially encrypting, e.g. encrypting the ending portion of a movie · CPC title

  • Recoverability · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016293167A1 cover?
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing speaker verification. In one aspect, a method includes accessing a neural network having an input layer that provides inputs to a first hidden layer whose nodes are respectively connected to only a proper subset of the inputs from the input layer. Speech data that corresponds to a p…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification H04N21/23406. Mapped technology areas include Electricity.
When was this patent published?
Publication date Thu Oct 06 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).