Intelligent automated assistant for TV user interactions
US-9338493-B2 · May 10, 2016 · US
US10325591B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10325591-B1 |
| Application number | US-201414478923-A |
| Country | US |
| Kind code | B1 |
| Filing date | Sep 5, 2014 |
| Priority date | Sep 5, 2014 |
| Publication date | Jun 18, 2019 |
| Grant date | Jun 18, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A speech interface device may capture user speech for analysis by automatic speech recognition (ASR) and natural language understanding (NLU) components. However, an audio signal representing the user speech may also contain interfering sound generated by a media player that is playing audio content such as music. Before performing ASR and NLU, a system attempts to identify the content being played by the media player, such as by querying the media player or by analyzing the audio signal. The system then obtains the same content from an available source and subtracts the audio represented by the content from the audio signal.
Opening claim text (preview).
The invention claimed is: 1. A speech-based system, comprising: one or more microphones configured to produce: a first input audio signal containing user speech and an interfering sound from a media content item played by a media player, the media player and the user in proximity to the speech-based system and the user speech including at least one spoken command for the speech-based system; and a second input audio signal containing the user speech and the interfering sound from the media content item played by the media player; one or more processors; non-transitory computer-readable storage media maintaining instructions executable by the one or more processors to perform operations comprising: selecting the first input audio signal as a first directional audio signal corresponding to a direction of a source of the user speech; selecting the second input audio signal as a second directional audio signal corresponding to a direction other than the direction of the source of the user speech based at least in part on a directional audio signal corresponding in direction to a known position of the media player; analyzing the second input audio signal to determine at least one characteristic of content of the second input audio signal; requesting an identity of a player content item being currently played by the media player and a temporal point within the player content item that is currently being output by the media player; generating an audio signature representative of the interfering sound based at least in part on the at least one characteristic of the content of the second input audio signal; identifying a plurality of media content items that are currently accessible to the media player; selecting a particular media content item of the plurality of media content items based at least in part on the audio signature, the identity of the player content item, the temporal point, and a reference audio signature that corresponds to the particular media content item; receiving at least a portion of the particular media content item that corresponds to the interfering sound from a reference content source; and processing the first input audio signal to suppress the interfering sound based at least in part on the at least the portion of the particular media content item by subtracting the portion of the particular media content item from the first input audio in order to obtain an interference-suppressed speech; and sending the interference-suppressed speech to a remote service for performing automatic speech recognition and natural language understanding on the interference-suppressed speech in order to determine an intent to perform or initiate functions or services expressed by the spoken command. 2. The speech-based system of claim 1 , the operations further comprising: causing an adaptive filter to produce an interference signal that estimates the interfering sound in the first input audio signal based at least in part on the media content item; and subtracting the interference signal from the first input audio signal to produce the interference-suppressed audio signal. 3. The speech-based system of claim 1 , further comprising a sensor configured to detect the direction of the source of the user speech. 4. The system of claim 1 , wherein the audio signature is a spectrogram that represents frequency intensities of the content of the second audio signal over time. 5. The system of claim 1 , wherein the audio signature is a spectrogram calculated over a portion of the content of the second audio signal. 6. The system of claim 5 , further comprising identifying the particular media content item by determining a first portion of the spectrogram of the audio signature that corresponds to a second portion of a reference spectrogram of the particular media content item. 7. The system of claim 1 , wherein the audio signature is a feature vector. 8. The system of claim 1 , wherein the audio signature includes one or more features representing direction of energy changes within frequency bands of the content of the second audio input over time. 9. A method being performed at a speech interface device in communication with a media player and a reference content source, the method comprising: receiving an input audio signal from a user in proximity of the speech interface device, wherein the input audio signal comprises a first input audio signal and a second input audio signal, the first input audio signal having a higher presence of user speech than the second input audio signal and the second input audio signal having a higher presence of interfering sound produced by a media player outputting audible audio sound in proximity to the speech interface device than the first input audio signal, the user speech including spoken commands to the speech interface; selecting the first input audio signal as a first directional audio signal corresponding to a direction of a source of the user speech; selecting, based in part on a directional audio signal corresponding in direction to a known position of the media player, the second input audio signal as a second directional audio signal corresponding to a direction other than the direction of the source of the user speech; analyzing the second input audio signal to identify at least one characteristic of content of the second input audio signal; requesting, by the speech interface device form the media player, an identity of a player content item being currently played by the media player and a temporal point within the player content item that is currently being output by the media player; determining, based at least in part on the at least one characteristic of the content of the second input audio signal, the identity of the player content item, and the temporal point, an identified media content item that includes sound corresponding to the interfering sound; obtaining a matching media content item from the reference content source, wherein the matching media content item matches the identified media content item that includes the sound corresponding to the interfering sound; processing the first input audio signal to suppress the identified media content item identified as the interfering sound in the first input audio signal by subtracting the matching media content item from the input audio in order to obtain an interference-suppressed speech; and sending the interference-suppressed speech to a remote service for performing automatic speech recognition and natural language understanding on the interference-suppressed speech in order to determine an intent to perform or initiate functions or services expressed by the spoken command. 10. The method of claim 9 , wherein processing the first input audio signal comprises removing a portion of the first input audio signal corresponding to the at least the portion of the media content item from the first input audio signal. 11. The method of claim 9 , further comprising: identifying an audio signature of the interfering sound; and comparing the audio signature to a reference audio signature of the media content item. 12. The method of claim 9 , further comprising: identifying a plurality of media content items that includes the media content item that are one or more of (a) currently accessible to the media player or (b) currently available to a user; and comparing the interfering sound from the second input audio signal with sound associated with the plurality of media content items. 13. The method of claim 9 , further comprising: receiving, from the media player, an indication of a source of the interfering sound; and selecting the refe
Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech (G10L21/02 takes precedence) · CPC title
Interactive procedures; Man-machine interfaces · CPC title
Noise filtering · CPC title
the noise being echo, reverberation of the speech · CPC title
for comparison or discrimination · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.