Sequence models for audio scene recognition
US-10930301-B1 · Feb 23, 2021 · US
US11355138B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11355138-B2 |
| Application number | US-202016997249-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 19, 2020 |
| Priority date | Aug 27, 2019 |
| Publication date | Jun 7, 2022 |
| Grant date | Jun 7, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method is provided. Intermediate audio features are generated from respective segments of an input acoustic time series for a same scene. Using a nearest neighbor search, respective segments of the input acoustic time series are classified based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic time series. Each respective segment corresponds to a respective different acoustic window. The generating step includes learning the intermediate audio features from Multi-Frequency Cepstral Component (MFCC) features extracted from the input acoustic time series, dividing the same scene into the different windows having varying MFCC features, and feeding the MFCC features of each window into respective LSTM units such that a hidden state of each respective LSTM unit is passed through an attention layer to identify feature correlations between hidden states at different time steps corresponding to different ones of the different windows.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for audio scene classification in an information retrieval system, comprising: generating intermediate audio features from respective segments of an input acoustic time series for a same scene captured by a sensor device; and classifying, using a nearest neighbor search, the respective segments of the input acoustic time series based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic time series, each of the respective segments corresponding to a respective different one of different acoustic windows; wherein said generating step comprises: learning the intermediate audio features from Multi-Frequency Cepstral Component (MFCC) features extracted from the input acoustic time series; dividing the same scene into the different acoustic windows having varying ones of the MFCC features; and feeding the MFCC features of each of the different acoustic windows into respective Long Short-Term Memory (LSTM) units such that a hidden state of each of the respective LSTM units is passed through an attention layer to identify feature correlations between hidden states at different time steps corresponding to different ones of the different acoustic windows, and wherein the method further includes replacing a hardware device monitored by the sensor responsive to the final intermediate feature. 2. The computer-implemented method of claim 1 , wherein the intermediate acoustic features both capture feature correlations between different acoustic windows in a same scene and isolate and weaken an effect of uninteresting features in the same scene. 3. The computer-implemented method of claim 1 , wherein said classifying step comprises generating the final intermediate feature for each of the different acoustic windows by optimizing a triplet loss function to which is added a regularization parameter computed on each of the intermediate audio features to reduce an importance of the uninteresting features, and wherein the uninteresting features comprise silence. 4. The computer-implemented method of claim 3 , wherein the triplet loss function adjusts a triplet selection algorithm to avoid using segments the uninteresting portions as silence and noise by using a silence and noise bias. 5. The computer-implemented method of claim 3 , wherein the regularization parameter is computed on a last element of each of the intermediate audio features, the last element being a silence weight. 6. The computer-implemented method of claim 3 , wherein the regularization parameter comprises a sum of silence weights and prevents all of the silence weights from simultaneously reaching a value of zero. 7. The computer-implemented method of claim 1 , wherein an entirety of the same scene is divided into overlapping windows to exploit inter-window dependencies. 8. The computer-implemented method of claim 1 , wherein each of the respective LSTM units comprise as many hidden states as time steps in a given current one of the windows. 9. The computer-implemented method of claim 1 , further comprising preprocessing the input acoustic sequence by applying a Fast Fourier Transform (FFT) to each of the different acoustic windows to extract respective acoustic frequency energy levels therefor. 10. The computer-implemented method of claim 1 , wherein the intermediate audio features are generated to isolate and weaken the effect of uninteresting features in the same scene using a triplet loss that pushes different classes farther apart than similar classes in a classification space. 11. The computer-implemented method of claim 1 , further comprising computing an embedding of the input acoustic time series as the weighted average of each of the hidden states. 12. The computer-implemented method of claim 11 , wherein the embedding is the final intermediate feature. 13. The computer-implemented method of claim 1 , further comprising receiving a query segment, and finding a most similar historical segment using a nearest neighbor. 14. The computer-implemented method of claim 1 , wherein said learning step learns the intermediate audio features by minimizing a loss function computed using the intermediate audio features of a randomly selected batch of segments from the input acoustic sequence. 15. The computer-implemented method of claim 1 , wherein the final intermediate feature is determined by majority voting on classifications for the segments forming the input acoustic time series. 16. A computer program product for audio scene classification in an information retrieval system, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: generating intermediate audio features from respective segments of an input acoustic time series for a same scene captured by a sensor device; and classifying, using a nearest neighbor search, the respective segments of the input acoustic time series based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic time series, each of the respective segments corresponding to a respective different one of different acoustic windows; wherein said generating step comprises: learning the intermediate audio features from Multi-Frequency Cepstral Component (MFCC) features extracted from the input acoustic time series; dividing the same scene into the different acoustic windows having varying ones of the MFCC features; and feeding the MFCC features of each of the different acoustic windows into respective Long Short-Term Memory (LSTM) units such that a hidden state of each of the respective LSTM units is passed through an attention layer to identify feature correlations between hidden states at different time steps corresponding to different ones of the different acoustic windows, and wherein the method further includes replacing a hardware device monitored by the sensor responsive to the final intermediate feature. 17. The computer program product of claim 16 , wherein the intermediate acoustic features both capture feature correlations between different acoustic windows in a same scene and isolate and weaken an effect of uninteresting features in the same scene. 18. The computer program product of claim 16 , wherein said classifying step comprises generating the final intermediate feature for each of the different acoustic windows by optimizing a triplet loss function to which is added a regularization parameter computed on each of the intermediate audio features to reduce an importance of the uninteresting features, and wherein the uninteresting features comprise silence. 19. The computer program product of claim 18 , wherein the triplet loss function adjusts a triplet selection algorithm to avoid using segments the uninteresting portions as silence and noise by using a silence and noise bias. 20. A computer processing system for audio scene classification in an information retrieval system, comprising: a memory device for storing program code; and a hardware processor, operatively coupled to the memory device, for running the program code to generate intermediate audio features from respective segments of an input acoustic time series for a same scene captured by a sensor device; and classify, using a nearest neighbor search, the respective segments of the input acoustic time se
Combinations of networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
the extracted parameters being the cepstrum · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.