Keyword detection modeling using contextual and environmental information

US9697828B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9697828-B1
Application numberUS-201414311163-A
CountryUS
Kind codeB1
Filing dateJun 20, 2014
Priority dateJun 20, 2014
Publication dateJul 4, 2017
Grant dateJul 4, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Features are disclosed for detecting words in audio using environmental information and/or contextual information in addition to acoustic features associated with the words to be detected. A detection model can be generated and used to determine whether a particular word, such as a keyword or “wake word,” has been uttered. The detection model can operate on features derived from an audio signal, contextual information associated with generation of the audio signal, and the like. In some embodiments, the detection model can be customized for particular users or groups of users based usage patterns associated with the users.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a computer-readable memory storing executable instructions; and one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least: obtain from a client device: an audio signal, wherein a first portion of the audio signal comprises audio data likely corresponding to a wake word, and wherein a second portion of the audio signal does not comprise audio data likely corresponding to the wake word; contextual information associated with the audio signal; and information indicating the first portion of the audio signal comprises audio data likely corresponding to the wake word; obtain acoustic information and environmental information from the first portion of the audio signal, wherein the acoustic information reflects one or more characteristics of a voice in the audio signal, and wherein the environmental information reflects one or more characteristics of an environment in which sound in the audio signal was recorded; determine whether audio data corresponding to the wake word is present in the audio signal using a server-side detection model configured to generate a detection score using the contextual information, the environmental information, the acoustic information, and natural language understanding results generated based at least partly on at least one of the audio signal or a subsequent audio signal, wherein a detection score greater than a detection threshold indicates that audio data corresponding to the wake word is present in the audio signal; in response to determining that audio data corresponding to the wake word is present in the audio signal, perform an action corresponding to a request in the audio signal; and in response to determining that audio data corresponding to the wake word is not present in the audio signal, close an audio signal stream from the client device. 2. The system of claim 1 , wherein the server-side detection model comprises a statistical classifier or a probabilistic logic network. 3. The system of claim 1 , wherein the server-side detection model is further configured to generate the detection score using automatic speech recognition results generated based at least partly on at least one of the audio signal or the subsequent audio signal. 4. The system of claim 1 , wherein the one or more processors are further programmed to: store information regarding determining whether audio data corresponding to the wake word is present in the audio signal; train a customized client-side detection model using training data based at least partly on the information regarding determining whether audio data corresponding to the wake word is present in the audio signal; and transmit the customized client-side detection model to the client device. 5. A computer-implemented method comprising: as implemented by one or more computing devices configured to execute specific instructions, obtaining from a client device: audio input comprising a plurality of portions of audio data, wherein less than all of the plurality of portions of audio data comprise audio data corresponding to a keyword detected by the client device; contextual information associated with the audio input; and information indicating a portion of audio data, of the plurality of portions of the audio data, that likely corresponds to the keyword; obtaining acoustic information and environmental information from the portion of audio data that likely corresponds to the keyword; determining that the portion of audio data corresponds to the keyword using a detection model configured to generate a detection score using the audio input, the contextual information, the environmental information, and the acoustic information, wherein a detection score satisfying a detection threshold indicates audio data corresponding to the keyword is present in the audio input; and performing an action corresponding to a request in the audio input. 6. The computer-implemented method of claim 5 , wherein the detection model is trained using training data comprising contextual information, acoustic information, environmental information, and linguistic information. 7. The computer-implemented method of claim 5 , wherein the acoustic information comprises natural language understanding results generated using the audio input. 8. The computer-implemented method of claim 5 , wherein the acoustic information reflects one or more characteristics of a voice in the audio input. 9. The computer-implemented method of claim 5 , wherein the environmental information reflects one or more characteristics of an environment in which sound in the audio input was captured. 10. The computer-implemented method of claim 5 , wherein the contextual information reflects at least one of a time at which the audio input was generated, a geographic location of an audio input device, or a physical orientation of the audio input device with respect to a user. 11. The computer-implemented method of claim 5 , wherein a detection score failing to satisfy the detection threshold indicates that the portion of the audio input does not correspond to the keyword. 12. The computer-implemented method of claim 5 , wherein the detection model comprises a probabilistic logic network comprising a rule defined by one of a system administrator or a system user. 13. The computer-implemented method of claim 5 , wherein the detection model comprises a probabilistic logic network comprising a rule automatically generated using a machine learning process. 14. The computer-implemented method of claim 5 , further comprising: storing information regarding use of the detection model to determine whether the portion of the audio input corresponds to the keyword; and training a customized client-side detection model using training data based at least partly on the information regarding use of the detection model. 15. The computer-implemented method of claim 14 , further comprising providing the customized client-side detection model to one or more client computing devices. 16. Non-transitory computer-readable storage comprising executable code that, when executed, causes one or more computing devices to perform a process comprising: obtaining from a client device: audio input, wherein less than all of the audio input comprises audio data likely corresponding to a keyword; contextual information associated with the audio input; and information indicating a portion of audio input, of a plurality of portions of the audio input, that likely corresponds to the keyword; obtaining acoustic information and environmental information from the portion of the audio input; determining that audio data corresponding to the keyword is present in the audio input using a detection model configured to generate a detection score using the audio input, the contextual information, the environmental information, and the acoustic information, wherein a detection score satisfying a detection threshold indicates audio data corresponding to the keyword is present in the audio input; and performing an action corresponding to a request in the audio input. 17. The non-transitory computer-readable storage of claim 16 , wherein the detection model is trained using training data comprising contextual information, acoustic information, environmental information, and linguistic information. 18. The non-transitory computer-readable storage of claim 16 , wherein the acoustic information comprises natural lan

Assignees

Inventors

Classifications

  • G10L15/18Primary

    using natural language modelling · CPC title

  • Distributed recognition, e.g. in client-server systems, for mobile phones or network applications · CPC title

  • Speech classification or search · CPC title

  • Word spotting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9697828B1 cover?
Features are disclosed for detecting words in audio using environmental information and/or contextual information in addition to acoustic features associated with the words to be detected. A detection model can be generated and used to determine whether a particular word, such as a keyword or “wake word,” has been uttered. The detection model can operate on features derived from an audio signal…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G10L15/18. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 04 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).