System and method for voice recognition
US-2017125020-A1 · May 4, 2017 · US
US2016293167A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2016293167-A1 |
| Application number | US-201615179717-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jun 10, 2016 |
| Priority date | Oct 10, 2013 |
| Publication date | Oct 6, 2016 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing speaker verification. In one aspect, a method includes accessing a neural network having an input layer that provides inputs to a first hidden layer whose nodes are respectively connected to only a proper subset of the inputs from the input layer. Speech data that corresponds to a particular utterance may be provided as input to the input layer of the neural network. A representation of activations that occur in response to the speech data at a particular layer of the neural network that was configured as a hidden layer during training of the neural network may be generated. A determination of whether the particular utterance was likely spoken by a particular speaker may be made based at least on the generated representation. An indication of whether the particular utterance was likely spoken by the particular speaker may be provided.
Opening claim text (preview).
1 . A computer-implemented method comprising: accessing a neural network having an input layer and one or more hidden layers, wherein at least one hidden layer of the one or more hidden layers has nodes that are respectively connected to only a proper subset of the inputs from a previous layer that provides input to the at least one hidden layer; inputting, to the input layer of the neural network, speech data that corresponds to a particular utterance; generating a representation of activations that occur, in response to inputting the speech data that corresponds to the particular utterance to the input layer, at a particular layer of the neural network that was configured as a hidden layer during training of the neural network; determining, based at least on the generated representation, whether the particular utterance was likely spoken by a particular speaker; and providing an indication of whether the particular utterance was likely spoken by the particular speaker. 2 . The method of claim 1 , wherein the at least one hidden layer is a locally-connected layer configured such that nodes at the at least one hidden layer respectively receive input from different subsets of data from the previous layer. 3 . The method of claim 1 , wherein each of the nodes of the at least one hidden layer receives input from a localized region of the outputs of the previous layer. 4 . The method of claim 3 , wherein each of the nodes of the at least one hidden layer receives input from a proper subset of the outputs of the previous layer that is localized in time. 5 . The method of claim 3 , wherein each of the nodes of the at least one hidden layer receives input from a proper subset of the outputs of the previous layer that is localized in frequency. 6 . The method of claim 1 , wherein each of the nodes of the at least one hidden layer receives input from a respective subset of inputs from the previous layer, the respective subset being localized in time and in frequency. 7 . The method of claim 6 , wherein the inputs provided by the previous layer indicate characteristics of the utterance at a first range of frequencies during each time frame in a first range of time; wherein for each of at least some of the nodes of the at least one hidden layer, the node is only connected to inputs from the previous layer that indicate characteristics of the utterance for a second range of frequencies during each time frame in a second range of time, wherein the second range of frequencies is a proper subset of the first range of frequencies and the second range of time is a proper subset of the first range of time. 8 . The method of claim 1 , wherein the previous layer provides a number of inputs to the at least one hidden layer; wherein, for each of the nodes of the at least one hidden layer, the neural network comprises a number of stored weight values that is less than the number of inputs to the at least one hidden layer. 9 . The method of claim 1 , wherein the at least one hidden layer is a convolutional layer. 10 . The method of claim 9 , wherein at least a group of the nodes of the at least one hidden layer are associated with a same set of weight values, wherein the neural network applies the same set of weight values to different subsets of the input for different nodes in the group. 11 . The method of claim 1 , comprising: comparing the generated representation with a reference representation of activations occurring at the particular layer of the neural network in response to speech data that corresponds to a past utterance of the particular speaker; and wherein determining whether the particular utterance was likely spoken by the particular speaker based at least on the generated representation comprises: based on comparing the generated representation and the reference representation, determining whether the particular utterance was likely spoken by the particular speaker. 12 . The method of claim 1 , wherein determining whether the particular utterance was likely spoken by the particular speaker based at least on the generated representation comprises: determining a cosine distance between the generated representation and a reference representation corresponding to the particular speaker; determining that the cosine distance satisfies a threshold; and based on determining that the cosine distance satisfies the threshold, determining that the particular utterance was likely spoken by the particular speaker. 13 . The method of claim 1 , further comprising dividing the speech data corresponding to the particular utterance into frames; and wherein generating the representation of activations occurring at the particular layer of the neural network comprises: determining, for each of multiple different frames of the speech data, a corresponding set of activations occurring at the particular layer of the neural network; and generating the representation of the activations occurring at the particular layer by averaging the sets of activations that respectively correspond to the multiple different frames. 14 . The method of claim 1 , wherein accessing the neural network comprises accessing a trained neural network that is not trained using speech of the particular speaker. 15 . The method of claim 14 , wherein accessing the neural network comprises: accessing a neural network having nodes at the first hidden layer that are each connected to a different subset of the inputs from the input layer, wherein the neural network has been trained based on activations occurring at an output layer located downstream from the particular layer. 16 . The method of claim 1 , wherein accessing the neural network comprises accessing, by a user device, a neural network stored at the user device. 17 . The method of claim 1 , comprising detecting the particular utterance at a mobile device that stores the neural network; wherein determining whether the particular utterance was likely spoken by the particular speaker comprises determining that the particular utterance was likely spoken by the particular speaker; and wherein providing an indication of whether the particular utterance was likely spoken by the particular speaker comprises unlocking or waking up the mobile device in response to determining that the particular utterance was likely spoken by the particular speaker. 18 . The method of claim 1 , wherein each node of the at least one hidden layer is connected to between 5% and 50% of the inputs from the previous layer. 19 . A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: accessing a neural network having an input layer and one or more hidden layers, wherein at least one hidden layer of the one or more hidden layers has nodes that are respectively connected to only a proper subset of the inputs from a previous layer that provides input to the at least one hidden layer; inputting, to the input layer of the neural network, speech data that corresponds to a particular utterance; generating a representation of activations that occur, in response to inputting the speech data that corresponds to the particular utterance to the input layer, at a particular layer of the neural network that was configured as a hidden layer during training of the neural network; determining, based at least on the generated representation, whether the particular utterance was lik
Combinations of networks · CPC title
involving management of server-side video buffer · CPC title
by partially encrypting, e.g. encrypting the ending portion of a movie · CPC title
Recoverability · CPC title
Supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.