Intelligent digital assistant system
US-10984782-B2 · Apr 20, 2021 · US
US11551685B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11551685-B2 |
| Application number | US-202016822744-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 18, 2020 |
| Priority date | Mar 18, 2020 |
| Publication date | Jan 10, 2023 |
| Grant date | Jan 10, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A speech interface device is configured to detect an interrupt event and process a voice command without detecting a wakeword. The device includes on-device interrupt architecture configured to detect when device-directed speech is present and send audio data to a remote system for speech processing. This architecture includes an interrupt detector that detects an interrupt event (e.g., device-directed speech) with low latency, enabling the device to quickly lower a volume of output audio and/or perform other actions in response to a potential voice command. In addition, the architecture includes a device directed classifier that processes an entire utterance and corresponding semantic information and detects device-directed speech with high accuracy. Using the device directed classifier, the device may reject the interrupt event and increase a volume of the output audio or may accept the interrupt event, causing the output audio to end and performing speech processing on the audio data.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, the method comprising: receiving, at a microphone, first audio; generating first input audio data representing the first audio; processing the first input audio data using a wakeword detector of a device to determine that the first input audio data includes a representation of a wakeword; sending at least a portion of the first input audio data for speech processing; receiving first output audio data generated in response to the first input audio data; generating, using the first output audio data and a loudspeaker of the device, first output audio; receiving, at the microphone, second audio; generating second input audio data representing the second audio; processing the second input audio data using a voice activity detector of the device to determine that first speech is represented in the second input audio data; processing the second input audio data using a first classifier of the device to determine that the first speech is directed to the device, the first classifier configured to process a first quantity of audio data; detecting a first endpoint of the first speech represented in the second input audio data; determining, using the first endpoint, a portion of the second input audio data that represents the first speech; processing the portion of the second input audio data using a second classifier to determine that the first speech is directed to the device, the second classifier configured to process a second quantity of audio data that is larger than the first quantity; causing output of the first output audio to stop; and sending at least the portion of the second input audio data for speech processing. 2. The computer-implemented method of claim 1 , further comprising: receiving second output audio data; generating, using the second output audio data and the loudspeaker, second output audio at a first volume level; receiving, at the microphone, third audio; generating third input audio data representing the third audio; processing the third input audio data using the voice activity detector to determine that second speech is represented in a portion of the third input audio data; processing the third input audio data using the first classifier to determine that the second speech is directed to the device; enabling a visual indicator that indicates that the device is listening; processing the third input audio data using the second classifier to determine that the second speech is not directed to the device; and disabling the visual indicator. 3. The computer-implemented method of claim 1 , wherein causing the output of the first output audio to stop further comprises: determining a current position in the first output audio data; identifying a word boundary represented in the first output audio data after the current position, the word boundary indicated by first information included within the first output audio data; determining a first portion of the first output audio data ending at the word boundary; generating the first output audio using the loudspeaker and the first portion of the first output audio data; and causing the first output audio to stop after the first portion of the first output audio data. 4. The computer-implemented method of claim 1 , wherein processing the second input audio data using the first classifier further comprises: determining a volume level of the second input audio data; determining first identification data associated with the first speech; determining emotion data corresponding to the first speech; determining a length of time between generating the first output audio and receiving the second audio; determining, using the first classifier, first model output data using the volume level, the first identification data, the emotion data, and the length of time; and determining that the first model output data satisfies a condition. 5. A computer-implemented method, comprising: generating first output audio using a loudspeaker associated with a device; receiving first audio data; processing the first audio data using a first component of the device to determine that a first portion of the first audio data represents first speech that is directed to the device; in response to determining that the first speech is represented in the first portion of the first audio data, performing a first action; determining that the first speech is represented in a second portion of the first audio data that includes the first portion of the first audio data; detecting, by a speech processing component of the device, an endpoint of the first speech represented in the first audio data; determining, by the speech processing component, semantic information associated with the first speech; processing, using a classifier of a second component of the device, the second portion of the first audio data and the semantic information to determine that the first speech corresponds to a first device-directed speech event; and causing speech processing to be performed on the second portion of the first audio data. 6. The computer-implemented method of claim 5 , further comprising: generating second output audio using the loudspeaker; receiving second audio data; processing the second audio data using the first component to determine that speech is represented in a portion of the second audio data; performing the first action; processing the second audio data using the second component to determine a confidence value indicating a likelihood that the second audio data corresponds to a second device-directed speech event; determining that the confidence value satisfies a condition; and performing a second action. 7. The computer-implemented method of claim 5 , further comprising: generating second output audio using the loudspeaker; receiving second audio data; processing the second audio data using the first component to determine that a first portion of the second audio data represents second speech that is directed to the device; performing the first action; determining that the second speech is represented in a second portion of the second audio data that includes the first portion of the second audio data; processing the second portion of the second audio data using the second component to determine that the second speech is not directed to the device; and performing a second action. 8. The computer-implemented method of claim 7 , further comprising: generating training data corresponding to the second audio data, the training data indicating a rejected interrupt event; and training the first component using the training data. 9. The computer-implemented method of claim 5 , further comprising: receiving output audio data; generating, using the loudspeaker and the output audio data, second output audio at a first volume level; receiving second audio data; processing the second audio data using the first component to determine that a portion of the second audio data represents second speech that is directed to the device; generating, using the loudspeaker and the output audio data, the second output audio at a second volume level that is lower than the first volume level; processing the second audio data using the second component to determine that the second audio data does not correspond to a second device-directed speech event; and generating, using the loudspeaker and the output audio data, the second output audio at the first volume level. 10. The computer-implemented method of claim 5 , further comprising: generating second output audio using a loudspeaker; receiving second audio data; processing the second audio data using the f
of the speaker; Human-factor methodology · CPC title
Barge in, i.e. overridable guidance for interrupting prompts · CPC title
Speech to text systems (G10L15/08 takes precedence) · CPC title
Execution procedure of a spoken command · CPC title
of application context · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.