What technology area does this patent fall under?

Primary CPC classification G10L25/51. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 07 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Audio scene recognition using time series analysis

US11355138B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11355138-B2
Application number	US-202016997249-A
Country	US
Kind code	B2
Filing date	Aug 19, 2020
Priority date	Aug 27, 2019
Publication date	Jun 7, 2022
Grant date	Jun 7, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method is provided. Intermediate audio features are generated from respective segments of an input acoustic time series for a same scene. Using a nearest neighbor search, respective segments of the input acoustic time series are classified based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic time series. Each respective segment corresponds to a respective different acoustic window. The generating step includes learning the intermediate audio features from Multi-Frequency Cepstral Component (MFCC) features extracted from the input acoustic time series, dividing the same scene into the different windows having varying MFCC features, and feeding the MFCC features of each window into respective LSTM units such that a hidden state of each respective LSTM unit is passed through an attention layer to identify feature correlations between hidden states at different time steps corresponding to different ones of the different windows.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for audio scene classification in an information retrieval system, comprising: generating intermediate audio features from respective segments of an input acoustic time series for a same scene captured by a sensor device; and classifying, using a nearest neighbor search, the respective segments of the input acoustic time series based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic time series, each of the respective segments corresponding to a respective different one of different acoustic windows; wherein said generating step comprises: learning the intermediate audio features from Multi-Frequency Cepstral Component (MFCC) features extracted from the input acoustic time series; dividing the same scene into the different acoustic windows having varying ones of the MFCC features; and feeding the MFCC features of each of the different acoustic windows into respective Long Short-Term Memory (LSTM) units such that a hidden state of each of the respective LSTM units is passed through an attention layer to identify feature correlations between hidden states at different time steps corresponding to different ones of the different acoustic windows, and wherein the method further includes replacing a hardware device monitored by the sensor responsive to the final intermediate feature. 2. The computer-implemented method of claim 1 , wherein the intermediate acoustic features both capture feature correlations between different acoustic windows in a same scene and isolate and weaken an effect of uninteresting features in the same scene. 3. The computer-implemented method of claim 1 , wherein said classifying step comprises generating the final intermediate feature for each of the different acoustic windows by optimizing a triplet loss function to which is added a regularization parameter computed on each of the intermediate audio features to reduce an importance of the uninteresting features, and wherein the uninteresting features comprise silence. 4. The computer-implemented method of claim 3 , wherein the triplet loss function adjusts a triplet selection algorithm to avoid using segments the uninteresting portions as silence and noise by using a silence and noise bias. 5. The computer-implemented method of claim 3 , wherein the regularization parameter is computed on a last element of each of the intermediate audio features, the last element being a silence weight. 6. The computer-implemented method of claim 3 , wherein the regularization parameter comprises a sum of silence weights and prevents all of the silence weights from simultaneously reaching a value of zero. 7. The computer-implemented method of claim 1 , wherein an entirety of the same scene is divided into overlapping windows to exploit inter-window dependencies. 8. The computer-implemented method of claim 1 , wherein each of the respective LSTM units comprise as many hidden states as time steps in a given current one of the windows. 9. The computer-implemented method of claim 1 , further comprising preprocessing the input acoustic sequence by applying a Fast Fourier Transform (FFT) to each of the different acoustic windows to extract respective acoustic frequency energy levels therefor. 10. The computer-implemented method of claim 1 , wherein the intermediate audio features are generated to isolate and weaken the effect of uninteresting features in the same scene using a triplet loss that pushes different classes farther apart than similar classes in a classification space. 11. The computer-implemented method of claim 1 , further comprising computing an embedding of the input acoustic time series as the weighted average of each of the hidden states. 12. The computer-implemented method of claim 11 , wherein the embedding is the final intermediate feature. 13. The computer-implemented method of claim 1 , further comprising receiving a query segment, and finding a most similar historical segment using a nearest neighbor. 14. The computer-implemented method of claim 1 , wherein said learning step learns the intermediate audio features by minimizing a loss function computed using the intermediate audio features of a randomly selected batch of segments from the input acoustic sequence. 15. The computer-implemented method of claim 1 , wherein the final intermediate feature is determined by majority voting on classifications for the segments forming the input acoustic time series. 16. A computer program product for audio scene classification in an information retrieval system, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: generating intermediate audio features from respective segments of an input acoustic time series for a same scene captured by a sensor device; and classifying, using a nearest neighbor search, the respective segments of the input acoustic time series based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic time series, each of the respective segments corresponding to a respective different one of different acoustic windows; wherein said generating step comprises: learning the intermediate audio features from Multi-Frequency Cepstral Component (MFCC) features extracted from the input acoustic time series; dividing the same scene into the different acoustic windows having varying ones of the MFCC features; and feeding the MFCC features of each of the different acoustic windows into respective Long Short-Term Memory (LSTM) units such that a hidden state of each of the respective LSTM units is passed through an attention layer to identify feature correlations between hidden states at different time steps corresponding to different ones of the different acoustic windows, and wherein the method further includes replacing a hardware device monitored by the sensor responsive to the final intermediate feature. 17. The computer program product of claim 16 , wherein the intermediate acoustic features both capture feature correlations between different acoustic windows in a same scene and isolate and weaken an effect of uninteresting features in the same scene. 18. The computer program product of claim 16 , wherein said classifying step comprises generating the final intermediate feature for each of the different acoustic windows by optimizing a triplet loss function to which is added a regularization parameter computed on each of the intermediate audio features to reduce an importance of the uninteresting features, and wherein the uninteresting features comprise silence. 19. The computer program product of claim 18 , wherein the triplet loss function adjusts a triplet selection algorithm to avoid using segments the uninteresting portions as silence and noise by using a silence and noise bias. 20. A computer processing system for audio scene classification in an information retrieval system, comprising: a memory device for storing program code; and a hardware processor, operatively coupled to the memory device, for running the program code to generate intermediate audio features from respective segments of an input acoustic time series for a same scene captured by a sensor device; and classify, using a nearest neighbor search, the respective segments of the input acoustic time se

Assignees

Inventors

Classifications

G06N3/045
Combinations of networks · CPC title
G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06N3/09
Supervised learning · CPC title
G10L25/24
the extracted parameters being the cepstrum · CPC title

Patent family

Related publications grouped by family.

View patent family 74659419

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11355138B2 cover?: A method is provided. Intermediate audio features are generated from respective segments of an input acoustic time series for a same scene. Using a nearest neighbor search, respective segments of the input acoustic time series are classified based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic time series. Each respective s…
Who is the assignee on this patent?: Nec Lab America Inc, Nec Corp
What technology area does this patent fall under?: Primary CPC classification G10L25/51. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 07 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).