Method and apparatus for sound event detection robust to frequency change
US-2019287550-A1 · Sep 19, 2019 · US
US10803885B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10803885-B1 |
| Application number | US-201816023923-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jun 29, 2018 |
| Priority date | Jun 29, 2018 |
| Publication date | Oct 13, 2020 |
| Grant date | Oct 13, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An audio event detection system that processes audio data into audio feature data and processes the audio feature data using pre-configured candidate interval lengths to identify top candidate regions of the feature data that may include an audio event. The feature data from the top candidate regions are then scored by a classifier, where the score indicates a likelihood that the candidate region corresponds to a desired audio event. The scores are compared to a threshold, and if the threshold is satisfied, the top scoring candidate region is determined to include an audio event.
Opening claim text (preview).
What is claimed is: 1. A method for detecting an audio event, the method comprising: receiving audio data; processing the audio data using a recurrent trained model to determine audio feature data; determining a first portion of the audio feature data corresponding to a first time window, wherein the first time window corresponds to a first pre-configured length of time; processing the first portion using a second trained model to determine a first score and an adjusted first time window; determining a second portion of the audio feature data corresponding to a second time window, wherein the second time window: corresponds to a second pre-configured length of time longer than the first pre-configured length of time, and includes, and is longer than, the first time window; processing the second portion using the second trained model to determine a second score and an adjusted second time window; determining an adjusted first portion of the audio feature data corresponding to the adjusted first time window; processing the adjusted first portion using a third trained model to determine a third score corresponding to a likelihood that an audio event is represented in the adjusted first portion; determining an adjusted second portion of the audio feature data corresponding to the adjusted second time window; processing the adjusted second portion using the third trained model to determine a fourth score corresponding to a likelihood that the audio event is represented in the adjusted second portion; determining the third score is higher than the fourth score; and storing an indication that the audio event occurred during the adjusted first time window. 2. The method of claim 1 , further comprising: outputting, by the second trained model, first indicator data corresponding to the adjusted first time window; processing, by the third trained model, the first indicator data to determine the adjusted first portion; processing, by the third trained model, the adjusted first portion to determine a feature vector having a pre-established length; and processing the feature vector using at least one dense layer to determine the third score. 3. The method of claim 1 , further comprising: determining a third portion of the audio feature data corresponding to a third time window, wherein the third time window: is different from the first time window, and corresponds to the first pre-configured length of time; processing the third portion using the second trained model to determine a fifth score and an adjusted third time window; determining a fourth portion of the audio feature data corresponding to a fourth time window, wherein the fourth time window: corresponds to the second pre-configured length of time, and is longer than the third time window; and processing the fourth portion using the second trained model to determine a fifth score and an adjusted fourth time window. 4. A method comprising: receiving audio data; processing the audio data using a recurrent trained model to determine audio feature data; determining a first portion of the audio feature data corresponding to a first time window; processing the first portion using a first model to determine a first score and an adjusted first time window; determining an adjusted first portion of the audio feature data corresponding to the adjusted first time window; and processing the adjusted first portion using a second model to determine a second score corresponding to a likelihood that an audio event is represented in the adjusted first portion. 5. The method of claim 4 , wherein: the first model is configured to determine respective scores corresponding to segments of feature data for a plurality of pre-configured lengths of time including at least a first length of time and a second length of time; and the first time window corresponds to the first length of time. 6. The method of claim 4 , wherein the first model comprises at least a first layer configured to determine a plurality of values corresponding to the first portion of the audio feature data and a second layer configured to output the first score. 7. The method of claim 5 , further comprising: determining a second portion of the audio feature data corresponding to a second time window, wherein the second time window: corresponds to the second length of time, and includes, and is longer than, the first time window; processing the second portion using the first model to determine a third score and an adjusted second time window; and determining that the third score is less than the first score. 8. The method of claim 5 , further comprising: determining a second portion of the audio feature data corresponding to a second time window, wherein the second time window: is different from the first time window, and corresponds to the first length of time; processing the second portion using the first model to determine a third score and an adjusted second time window; determining a third portion of the audio feature data corresponding to a third time window, wherein the third time window: corresponds to the second length of time, and includes, and is longer than, the second time window; and processing the third portion using the first model to determine a fourth score and an adjusted third time window. 9. The method of claim 8 , further comprising: determining an adjusted second portion of the audio feature data corresponding to the adjusted second time window; processing the adjusted second portion using the second model to determine a fifth score corresponding to a likelihood that the audio event is represented in the adjusted second time window; determining an adjusted third portion of the audio feature data corresponding to the adjusted third time window; processing the adjusted third portion using the second model to determine a sixth score corresponding to a likelihood that the audio event is represented in the adjusted third time window; and determining that the second score is greater than the fifth score and the sixth score. 10. The method of claim 4 , further comprising: receiving log filter bank energy data representing the audio data; and processing the log filter bank energy data using the recurrent trained model to determine the audio feature data. 11. The method of claim 4 , wherein determining the first portion of the audio feature data comprises: determining a first plurality of audio feature data values corresponding to a center of the first time window; determining a second plurality of audio feature data values, the second plurality corresponding to a first portion of the first time window prior to the center; determining a third plurality of audio feature data values, the second plurality corresponding to a second portion of the first time window subsequent to the center; and including, in the first portion of audio feature data, the first plurality of audio feature data values, the second plurality of audio feature data values, and the third plurality of audio feature data values. 12. The method of claim 4 , wherein the second model comprises a classifier including at least a first layer configured to combine the adjusted first portion of the audio feature data and a second layer, subsequent to the first layer, configured to output the second score. 13. The method of claim 4 , further comprising: determining the second score is above a threshold; and causing an action to be performed in response to the second score being above the threshold. 14. The method of claim 13 , wherein the action comprises storing an indicati
Combinations of networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Supervised learning · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.