Preventing of audio attacks
US-2018158453-A1 · Jun 7, 2018 · US
US11670299B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11670299-B2 |
| Application number | US-202117321999-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 17, 2021 |
| Priority date | Jun 26, 2019 |
| Publication date | Jun 6, 2023 |
| Grant date | Jun 6, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system processes audio data to detect when it includes a representation of a wakeword or of an acoustic event. The system may receive or determine acoustic features for the audio data, such as log-filterbank energy (LFBE). The acoustic features may be used by a first, wakeword-detection model to detect the wakeword; the output of this model may be further processed using a softmax function, to smooth it, and to detect spikes. The same acoustic features may be also be used by a second, acoustic-event-detection model to detect the acoustic event; the output of this model may be further processed using a sigmoid function and a classifier. Another model may be used to extract additional features from the LFBE data; these additional features may be used by the other models.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: determining a feature vector representing at least one frame of audio data; determining, using a first model and the feature vector, first output data corresponding to a likelihood that the at least one frame includes a representation of at least part of a word; and determining, using a second model different from the first model and the feature vector, second output data corresponding to a likelihood that the at least one frame includes a representation of at least part of a non-speech acoustic event, wherein determination of the second output data is performed independently of the first output data. 2. The computer-implemented method of claim 1 , further comprising: processing the first output data using a normalization component to determine first probability data. 3. The computer-implemented method of claim 1 , further comprising: processing the second output data using at least one activation function component to determine the second output data. 4. The computer-implemented method of claim 3 , further comprising: processing the second output data using a classifier to detect an occurrence of the non- speech acoustic event. 5. The computer-implemented method of claim 1 , wherein the non-speech acoustic event comprises a non-speech sound made by a human. 6. The computer-implemented method of claim 1 , wherein the first output data corresponds to a likelihood that the at least one frame includes a representation of at least part of a first wakeword. 7. The computer-implemented method of claim 6 , further comprising: determining, using the feature vector, third output data corresponding to a likelihood that the at least one frame includes a representation of at least part of a second wakeword. 8. The computer-implemented method of claim 1 , further comprising: receiving the at least one frame of audio data; and processing the at least one frame of audio data using a feature-extraction model to determine the feature vector, the feature-extraction model configured to determine feature output data operable by both the first model and the second model, wherein determining the first output data comprises processing the feature vector using the first model, and wherein determining the second output data comprises processing the feature vector using the second model. 9. The computer-implemented method of claim 1 , wherein the feature vector represents acoustic feature data and the method further comprises: processing the feature vector using a feature-extraction model to determine a second feature vector, the feature-extraction model configured to determine feature output data operable by both the first model and the second model, wherein determining the first output data comprises processing the second feature vector using the first model, and wherein determining the second output data comprises processing the second feature vector using the second model. 10. The computer-implemented method of claim 1 , wherein: determining the first output data comprises: processing the feature vector using a feature extraction component to determine first feature data, and processing the first feature data using the first model to determine the first output data; and determining the second output data comprises: processing the feature vector using the feature extraction component to determine second feature data, and processing the second feature data using the second model to determine the second output data. 11. The computer-implemented method of claim 1 , wherein: the feature vector represents acoustic feature data; the first model comprises a feature extraction component; determining the first output data comprises: processing the feature vector using the first model to determine a second feature vector, and using the second feature vector to determine the first output data; and determining the second output data comprises using the second feature vector and the second model to determine the second output data. 12. A system comprising: at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor, cause the system to: determine a feature vector representing at least one frame of audio data; determine, using a first model and the feature vector, first output data corresponding to a likelihood that the at least one frame includes a representation of at least part of a wakeword; and determine, using a second model different from the first model and the feature vector, second output data corresponding to a likelihood that the at least one frame includes a representation of at least part of a non-speech acoustic event, wherein determination of the second output data is performed independently of the first output data. 13. The system of claim 12 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process the first output data using a normalization component to determine first probability data. 14. The system of claim 12 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process the second output data using at least one activation function component to determine the second output data. 15. The system of claim 14 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process the second output data using a classifier to detect an occurrence of the non- speech acoustic event. 16. The system of claim 12 , wherein the non-speech acoustic event comprises a non-speech sound made by a human. 17. The system of claim 12 , wherein the first output data corresponds to a likelihood that the at least one frame includes a representation of at least part of a first wakeword. 18. The system of claim 17 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine, using the feature vector, third output data corresponding to a likelihood that the at least one frame includes a representation of at least part of a second wakeword. 19. The system of claim 12 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: receive the at least one frame of audio data; and process the at least one frame of audio data using a feature-extraction model to determine the feature vector, the feature-extraction model configured to determine feature output data operable by both the first model and the second model, wherein the instructions that cause the system to determine the first output data comprise instructions that, when executed by the at least one processor, further cause the system to process the feature vector using the first model, and wherein the instructions that cause the system to determine the second output data comprise instructions that, when executed by the at least one processor, further cause the system to process the feature vector using the second model. 20. The system of claim 12 , wherein the feature vector represents acoustic feature data and wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: process the feature v
Related publications grouped by family.
Answers are generated from the same data shown on this page.