Method and apparatus for activating application by speech input
US-2015302855-A1 · Oct 22, 2015 · US
US9697828B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9697828-B1 |
| Application number | US-201414311163-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jun 20, 2014 |
| Priority date | Jun 20, 2014 |
| Publication date | Jul 4, 2017 |
| Grant date | Jul 4, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Features are disclosed for detecting words in audio using environmental information and/or contextual information in addition to acoustic features associated with the words to be detected. A detection model can be generated and used to determine whether a particular word, such as a keyword or “wake word,” has been uttered. The detection model can operate on features derived from an audio signal, contextual information associated with generation of the audio signal, and the like. In some embodiments, the detection model can be customized for particular users or groups of users based usage patterns associated with the users.
Opening claim text (preview).
What is claimed is: 1. A system comprising: a computer-readable memory storing executable instructions; and one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least: obtain from a client device: an audio signal, wherein a first portion of the audio signal comprises audio data likely corresponding to a wake word, and wherein a second portion of the audio signal does not comprise audio data likely corresponding to the wake word; contextual information associated with the audio signal; and information indicating the first portion of the audio signal comprises audio data likely corresponding to the wake word; obtain acoustic information and environmental information from the first portion of the audio signal, wherein the acoustic information reflects one or more characteristics of a voice in the audio signal, and wherein the environmental information reflects one or more characteristics of an environment in which sound in the audio signal was recorded; determine whether audio data corresponding to the wake word is present in the audio signal using a server-side detection model configured to generate a detection score using the contextual information, the environmental information, the acoustic information, and natural language understanding results generated based at least partly on at least one of the audio signal or a subsequent audio signal, wherein a detection score greater than a detection threshold indicates that audio data corresponding to the wake word is present in the audio signal; in response to determining that audio data corresponding to the wake word is present in the audio signal, perform an action corresponding to a request in the audio signal; and in response to determining that audio data corresponding to the wake word is not present in the audio signal, close an audio signal stream from the client device. 2. The system of claim 1 , wherein the server-side detection model comprises a statistical classifier or a probabilistic logic network. 3. The system of claim 1 , wherein the server-side detection model is further configured to generate the detection score using automatic speech recognition results generated based at least partly on at least one of the audio signal or the subsequent audio signal. 4. The system of claim 1 , wherein the one or more processors are further programmed to: store information regarding determining whether audio data corresponding to the wake word is present in the audio signal; train a customized client-side detection model using training data based at least partly on the information regarding determining whether audio data corresponding to the wake word is present in the audio signal; and transmit the customized client-side detection model to the client device. 5. A computer-implemented method comprising: as implemented by one or more computing devices configured to execute specific instructions, obtaining from a client device: audio input comprising a plurality of portions of audio data, wherein less than all of the plurality of portions of audio data comprise audio data corresponding to a keyword detected by the client device; contextual information associated with the audio input; and information indicating a portion of audio data, of the plurality of portions of the audio data, that likely corresponds to the keyword; obtaining acoustic information and environmental information from the portion of audio data that likely corresponds to the keyword; determining that the portion of audio data corresponds to the keyword using a detection model configured to generate a detection score using the audio input, the contextual information, the environmental information, and the acoustic information, wherein a detection score satisfying a detection threshold indicates audio data corresponding to the keyword is present in the audio input; and performing an action corresponding to a request in the audio input. 6. The computer-implemented method of claim 5 , wherein the detection model is trained using training data comprising contextual information, acoustic information, environmental information, and linguistic information. 7. The computer-implemented method of claim 5 , wherein the acoustic information comprises natural language understanding results generated using the audio input. 8. The computer-implemented method of claim 5 , wherein the acoustic information reflects one or more characteristics of a voice in the audio input. 9. The computer-implemented method of claim 5 , wherein the environmental information reflects one or more characteristics of an environment in which sound in the audio input was captured. 10. The computer-implemented method of claim 5 , wherein the contextual information reflects at least one of a time at which the audio input was generated, a geographic location of an audio input device, or a physical orientation of the audio input device with respect to a user. 11. The computer-implemented method of claim 5 , wherein a detection score failing to satisfy the detection threshold indicates that the portion of the audio input does not correspond to the keyword. 12. The computer-implemented method of claim 5 , wherein the detection model comprises a probabilistic logic network comprising a rule defined by one of a system administrator or a system user. 13. The computer-implemented method of claim 5 , wherein the detection model comprises a probabilistic logic network comprising a rule automatically generated using a machine learning process. 14. The computer-implemented method of claim 5 , further comprising: storing information regarding use of the detection model to determine whether the portion of the audio input corresponds to the keyword; and training a customized client-side detection model using training data based at least partly on the information regarding use of the detection model. 15. The computer-implemented method of claim 14 , further comprising providing the customized client-side detection model to one or more client computing devices. 16. Non-transitory computer-readable storage comprising executable code that, when executed, causes one or more computing devices to perform a process comprising: obtaining from a client device: audio input, wherein less than all of the audio input comprises audio data likely corresponding to a keyword; contextual information associated with the audio input; and information indicating a portion of audio input, of a plurality of portions of the audio input, that likely corresponds to the keyword; obtaining acoustic information and environmental information from the portion of the audio input; determining that audio data corresponding to the keyword is present in the audio input using a detection model configured to generate a detection score using the audio input, the contextual information, the environmental information, and the acoustic information, wherein a detection score satisfying a detection threshold indicates audio data corresponding to the keyword is present in the audio input; and performing an action corresponding to a request in the audio input. 17. The non-transitory computer-readable storage of claim 16 , wherein the detection model is trained using training data comprising contextual information, acoustic information, environmental information, and linguistic information. 18. The non-transitory computer-readable storage of claim 16 , wherein the acoustic information comprises natural lan
using natural language modelling · CPC title
Distributed recognition, e.g. in client-server systems, for mobile phones or network applications · CPC title
Speech classification or search · CPC title
Word spotting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.