Hold back and real time ranking of results in a streaming matching system
US-9529907-B2 · Dec 27, 2016 · US
US9728188B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9728188-B1 |
| Application number | US-201615195587-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jun 28, 2016 |
| Priority date | Jun 28, 2016 |
| Publication date | Aug 8, 2017 |
| Grant date | Aug 8, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods for detecting similar audio being received by separate voice activated electronic devices, and ignoring those commands, is described herein. In some embodiments, a voice activated electronic device may be activated by a wakeword that is output by the additional electronic device, such as a television or radio, may capture audio of sound subsequently following the wakeword, and may send audio data representing the sound to a backend system. Upon receipt, the backend system may, in parallel to performing automated speech recognition processing to the audio data, generate a sound profile of the audio data, and may compare that sound profile to sound profiles of recently received audio data and/or flagged sound profiles. If the generated sound profile is determined to match another sound profiles, then the automated speech recognition processing may be stopped, and the voice activated electronic device may be instructed to return to a keyword spotting mode. If the matching sound profile is not already stored in a database of known sound profiles, it can be stored for future comparisons.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: receiving, at a backend system, first audio data; receiving a first timestamp indicating a first time that the first audio data was sent to the backend system by a first user device; receiving, at the backend system, second audio data; receiving a second timestamp indicating a second time that the second audio data was sent to the backend system by a second user device; determining that an amount of time between the first time and the second time is less than a predetermined period of time, which indicates that the first audio data and the second audio data were sent at a substantially same time; generating a first audio fingerprint of the first audio data by performing a first fast Fourier transform (“FFT”) on the first audio data, the first audio fingerprint comprising first data representing a first time-frequency profile of the first audio data; generating a second audio fingerprint of the second audio data by performing a second FFT on the second audio data, the second audio fingerprint comprising second data representing a second time-frequency profile of the second audio data; determining a bit error rate between the first audio fingerprint and the second audio fingerprint by determining a number of different bits between the first audio fingerprint and the second audio fingerprint, and then dividing the number by a total number of bits; determining that the bit error rate is less than a predefined bit error rate threshold value indicating that the first audio data and the second audio data both represent a same sound; and storing the first audio fingerprint as a flagged audio fingerprint in memory on the backend system such that receipt of additional audio data that has a matching audio fingerprint is ignored by the backend system. 2. The method of claim 1 , further comprising: receiving, at the backend system, third audio data; generating a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data; determining an additional bit error rate between the third audio fingerprint and the flagged audio fingerprint; determining that the additional bit error rate is less than the predefined bit error rate threshold value indicating that the third audio data also represents the same sound; and causing the backend system to ignore the third audio data such that a response is not generated to respond to the third audio data. 3. The method of claim 1 , further comprising: receiving, at the backend system, third audio data; generating a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data; determining a new bit error rate between the third audio fingerprint and the flagged audio fingerprint; determining that the new bit error rate is greater than the predefined bit error rate threshold value indicating that third audio data does not represent the same sound; and generating text data representing the third audio data by executing speech-to-text functionality on the third audio data. 4. The method of claim 1 , further comprising: determining a first user identifier associated with the first user device; determining a second user identifier associated with the second user device; determining that the first user identifier is different than the second user identifier; generating a first instruction for the first user device that causes the first user device to return to a keyword spotting mode where the first user device will monitor sound signals received by a microphone for a subsequent utterance of a wakeword by continuously running the sound signals through a wakeword engine; generating a second instruction for the second user device that causes the second user device to return to the keyword spotting mode; sending the first instruction to the first user device; and sending the second instruction to the second user device. 5. The method of claim 1 , further comprising: causing automated speech recognition processing to stop being performed to the first audio data; and causing the automated speech recognition processing to stop being performed to the second audio data. 6. The method of claim 1 , further comprising: receiving, at the backend system, third audio data; receiving a third timestamp indicating a third time that the third audio data was sent to the backend system by a third user device; determining that an additional amount of time between the first time and the third time is greater than the predetermined period of time, which indicates that the first audio data and the third audio data were sent at a different time; generating a third audio fingerprint of the third audio data by performing a third FFT on the third audio data, the third audio fingerprint comprising third data representing a third time-frequency profile of the third audio data; determining a new bit error rate between the flagged audio fingerprint and the third audio fingerprint; determining that the new bit error rate is greater than the predefined bit error rate threshold value indicating that third audio data does not represent the same sound; receiving a first plurality of audio fingerprints corresponding to a second plurality of audio data that were received during the additional amount of time; determining a third plurality of bit error rates between the third audio fingerprint and each of the first plurality of audio fingerprints; determining that each of the third plurality of bit error rates are greater than the predefined bit error rate threshold value, indicating that each of the second plurality of audio data represent a different sound than the third audio data; and causing automated speech recognition processing to continue to be performed to the third audio data. 7. The method of claim 6 , further comprising: determining a new amount of time between the third time and a fourth time, the fourth time corresponding to a fourth audio fingerprint of fourth audio data received prior to the first audio data, the second audio data, and the third audio data; determining that the new amount of time is greater than the amount of time; determining that the new amount of time is greater than the additional amount of time; determining that the fourth audio fingerprint correspond to an oldest audio fingerprint of the plurality of audio fingerprints; causing the fourth audio fingerprint to be deleted; determining an updated first plurality of audio fingerprints comprising the first plurality of audio fingerprints minus the fourth audio fingerprint; and generating a fourth plurality of audio fingerprints comprising the updated first plurality of audio fingerprints and the third audio fingerprint. 8. The method of claim 1 , further comprising: receiving a third audio fingerprint of third audio data, wherein the first audio fingerprint is generated at a first speech processing component, and the third audio fingerprint is generated at a second speech processing component; causing the third audio fingerprint to be stored in the memory; determining an additional bit error rate between first audio fingerprint and the third audio fingerprint; determining that the additional bit error rate is less than the predefined bit error rate threshold value; and causing automated speech recognition processing to stop being performed to the third audio data. 9. The method of claim 1 , further comprising: receiving, at the backend system, third audio
of the speaker; Human-factor methodology · CPC title
Memory allocation or algorithm optimisation to reduce hardware requirements · CPC title
Speech classification or search · CPC title
Word spotting · CPC title
for comparison or discrimination · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.