Methods and apparatus for detecting a voice command
US-9112984-B2 · Aug 18, 2015 · US
US9401143B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9401143-B2 |
| Application number | US-201514663610-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 20, 2015 |
| Priority date | Mar 24, 2014 |
| Publication date | Jul 26, 2016 |
| Grant date | Jul 26, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving data representing acoustic characteristics of a user's voice; selecting a cluster for the data from among a plurality of clusters, where each cluster includes a plurality of vectors, and where each cluster is associated with a speech model trained by a neural network using at least one or more vectors of the plurality of vectors in the respective cluster; and in response to receiving one or more utterances of the user, providing the speech model associated with the cluster for transcribing the one or more utterances.
Opening claim text (preview).
What is claimed is: 1. A method comprising: receiving data representing acoustic characteristics of a user's voice; selecting a cluster for the data from among a plurality of clusters, wherein each cluster includes a plurality of vectors, and wherein each cluster is associated with a speech model trained by a neural network using at least one or more vectors of the plurality of vectors in the respective cluster; and in response to receiving one or more utterances of the user, providing the speech model associated with the cluster for transcribing the one or more utterances. 2. The method of claim 1 , wherein the plurality of clusters are segmented based on vector distances to centroids of the clusters, and wherein selecting a cluster for the data comprises: determining a vector based on the data; determining that a vector distance between the vector and the cluster is a shortest distance compared to vector distances between the vector and other clusters of the plurality of clusters; and based on determining that the vector distance between the vector and the cluster is the shortest distance, selecting the cluster for the vector. 3. The method of claim 1 , wherein selecting a cluster for the data further comprises: receiving data indicative of latent variables of multivariate factor analysis of an audio signal of the user; and selecting an updated cluster using the latent variables. 4. The method of claim 1 , comprising: receiving a feature vector that models audio characteristics of a portion of an utterance of the user; and determining, using the feature vector as an input, a candidate transcription for the utterance based on an output of the neural network of the speech model. 5. The method of claim 1 , wherein providing the speech model for transcribing the one or more utterances comprises providing the speech model to a computing device of the user. 6. The method of claim 1 , wherein the acoustic characteristics of the user includes a gender of the user, an accent of the user, a pitch of an utterance of the user, background noises around the user, or age group of the user. 7. The method of claim 1 , wherein the data is an i-vector, and wherein the neural network is trained using the i-vectors in the cluster and one or more i-vectors in one or more neighboring clusters. 8. The method of claim 1 , wherein each cluster includes a distinct plurality of vectors, and wherein each cluster is associated with a distinct speech model. 9. A non-transitory computer-readable medium storing software having stored thereon instructions, which, when executed by one or more computers, cause the one or more computers to perform operations of: receiving data representing acoustic characteristics of a user's voice; selecting a cluster for the data from among a plurality of clusters, wherein each cluster includes a plurality of vectors, and wherein each cluster is associated with a speech model trained by a neural network using at least one or more vectors of the plurality of vectors in the respective cluster; and in response to receiving one or more utterances of the user, providing the speech model associated with the cluster for transcribing the one or more utterances. 10. The non-transitory computer-readable medium of claim 9 , wherein the plurality of clusters are segmented based on vector distances to centroids of the clusters, and wherein selecting a cluster for the data comprises: determining a vector based on the data; determining that a vector distance between the vector and the cluster is a shortest distance compared to vector distances between the vector and other clusters of the plurality of clusters; and based on determining that the vector distance between the vector and the cluster is the shortest distance, selecting the cluster for the vector. 11. The non-transitory computer-readable medium of claim 9 , wherein selecting a cluster for the data further comprises: receiving data indicative of latent variables of multivariate factor analysis of an audio signal of the user; and selecting an updated cluster using the latent variables. 12. The non-transitory computer-readable medium of claim 9 , wherein the operations comprise: receiving a feature vector that models audio characteristics of a portion of an utterance of the user; and determining, using the feature vector as an input, a candidate transcription for the utterance based on an output of the neural network of the speech model. 13. The non-transitory computer-readable medium of claim 9 , wherein providing the speech model for transcribing the one or more utterances comprises providing the speech model to a computing device of the user. 14. The non-transitory computer-readable medium of claim 9 , wherein the data is an i-vector, and wherein the neural network is trained using the i-vectors in the cluster and one or more i-vectors in one or more neighboring clusters. 15. A system comprising: one or more processors and one or more computer storage media storing instructions that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising: receiving data representing acoustic characteristics of a user's voice; selecting a cluster for the data from among a plurality of clusters, wherein each cluster includes a plurality of vectors, and wherein each cluster is associated with a speech model trained by a neural network using at least one or more vectors of the plurality of vectors in the respective cluster; and in response to receiving one or more utterances of the user, providing the speech model associated with the cluster for transcribing the one or more utterances. 16. The system of claim 15 , wherein the plurality of clusters are segmented based on vector distances to centroids of the clusters, and wherein selecting a cluster for the data comprises: determining a vector based on the data; determining that a vector distance between the vector and the cluster is a shortest distance compared to vector distances between the vector and other clusters of the plurality of clusters; and based on determining that the vector distance between the vector and the cluster is the shortest distance, selecting the cluster for the vector. 17. The system of claim 15 , wherein selecting a cluster for the data further comprises: receiving data indicative of latent variables of multivariate factor analysis of an audio signal of the user; and selecting an updated cluster using the latent variables. 18. The system of claim 15 , wherein the operations comprise: receiving a feature vector that models audio characteristics of a portion of an utterance of the user; and determining, using the feature vector as an input, a candidate transcription for the utterance based on an output of the neural network of the speech model. 19. The system of claim 15 , wherein providing the speech model for transcribing the one or more utterances comprises providing the speech model to a computing device of the user. 20. The system of claim 15 , wherein the data is an i-vector, and wherein the neural network is trained using the i-vectors in the cluster and one or more i-vectors in one or more neighboring clusters.
using context dependencies, e.g. language models · CPC title
Training · CPC title
Creating reference templates; Clustering · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.