System and method for anomaly detection in unlabeled collections of audio recording

US12554943B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12554943-B2
Application numberUS-202318222409-A
CountryUS
Kind codeB2
Filing dateJul 14, 2023
Priority dateJul 14, 2023
Publication dateFeb 17, 2026
Grant dateFeb 17, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In some implementations, the device may include receiving a first and second audio dataset. In addition, the device may generate a first, a second, a third, and a fourth audio sample. Moreover, the device may include determining a level of similarity between the first and second audio samples. Also, the device may include combining the first and second audio samples into an audio pair. Further, the device may include training a machine learning model to map audio samples to a latent space visualization in view of time and the similarities between the first and second audio samples to yield a trained machine learning model. In addition, the device may include mapping, by the machine learning model, in the latent space visualization, the third and fourth audio samples, where placement of the third and fourth audio samples depends on the level of similarity between the third and fourth audio samples.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A computer-implemented method for detecting an anomaly in audio data, comprising: receiving, from a microphone, a first audio dataset and a second audio dataset; generating, based on the first audio dataset, a first audio sample and a second audio sample, where each of the first and second audio samples are smaller than the first audio dataset and the first audio sample is distinct from the second audio sample, wherein the generating is conducted by sampling in a time domain by segmenting raw waveforms of the first audio sample and second audio sample; generating, based on the second audio dataset, a third audio sample and a fourth audio sample, where each of the third and fourth audio samples are smaller than the second audio dataset and the third audio sample is distinct from the fourth audio sample, wherein the generating is conducted by sampling in a time domain by segmenting raw waveforms; determining a level of similarity between the first audio sample and the second audio sample utilizing a pretext classifier that includes a neural network configured to predict a probability of one of the audio samples belonging to an associated cluster in response to identifying an anomaly score; combining the first audio sample and the second audio sample into an audio pair in response to the level of similarity between the first audio sample and the second audio sample being above a first predetermined threshold; training a machine learning model, based on the audio pair, to map audio samples to a latent space visualization in view of time and the similarities between the first audio sample and the second audio sample to yield a trained machine learning model; and mapping, by the trained machine learning model, in the latent space visualization, the third audio sample and the fourth audio sample where placement of the third audio sample and the fourth audio sample depends on the level of similarity of the third audio sample and the fourth audio sample, as determined by the trained machine learning model. 2 . A computer-implemented method of claim 1 , comprising: labeling the third audio sample and the fourth audio sample based on a number of clusters in the latent space visualization. 3 . A computer-implemented method of claim 2 , comprising: receiving a fifth audio sample; generating a probability score for the fifth audio sample wherein the probability score indicates a probability that the fifth audio sample is associated with the cluster; comparing the probability score with a second predetermined threshold; and associating the fifth audio sample with the cluster in response to the probability score being greater than the second predetermined threshold. 4 . The computer-implemented method of claim 1 , wherein the first audio sample and the second audio sample do not overlap in view of the first audio dataset. 5 . The computer-implemented method of claim 1 , wherein the training of the machine learning model is performed via a self-supervised contrastive learning objectives. 6 . The computer-implemented method of claim 1 , wherein the first audio dataset and the second audio dataset do not include human annotations. 7 . The computer-implemented method of claim 1 , wherein the mapping of the third audio sample and the fourth audio sample in the latent space visualization creates a cluster and the method further comprising: determining a shared attribute between the third audio sample and the fourth audio sample; and labeling the cluster based on the shared attributes of the third audio sample and the fourth audio sample. 8 . A system for detecting an anomaly in audio data comprising: one or more processors configured to: receive, from a microphone, a first audio dataset and a second audio dataset; generate, based on the first audio dataset, a first audio sample and a second audio sample, where each of the first and second audio samples are smaller than the first audio dataset and the first audio sample is distinct from the second audio sample, wherein the generating is conducted by sampling in a time domain by segmenting raw waveforms of the first audio sample and second audio sample; generate, based on the second audio dataset, a third audio sample and a fourth audio sample, where each of the third and fourth audio samples are smaller than the second audio dataset and the third audio sample is distinct from the fourth audio sample, wherein the generating is conducted by sampling in a time domain by segmenting raw waveforms; determine a level of similarity between the first audio sample and the second audio sample utilizing a pretext classifier that includes a neural network configured to predict a probability of one of the audio samples belonging to an associated cluster in response to identifying an anomaly score; combine the first audio sample and the second audio sample into an audio pair in response to the level of similarity between the first audio sample and the second audio sample being above a first predetermined threshold; train a machine learning model, based on the audio pair, to map audio samples to a latent space visualization in view of time and the similarities between the first audio sample and the second audio sample to yield a trained machine learning model; and map, by the trained machine learning model, in the latent space visualization, the third audio sample and the fourth audio sample where placement of the third audio sample and the fourth audio sample depends on the level of similarity of the third audio sample and the fourth audio sample, as determined by the trained machine learning model. 9 . The system of claim 8 , wherein the mapping of the third audio sample and the fourth audio sample in the latent space visualization creates a cluster and the one or more processors, are further configured to: label the third audio sample and the fourth audio sample based on a number of clusters in the latent space visualization. 10 . The system of claim 9 , wherein the one or more processors are further configured to: receive a fifth audio sample; generate a probability score for the fifth audio sample wherein the probability score indicates a probability that the fifth audio sample is associated with the cluster; compare the probability score with a second predetermined threshold; and associate the fifth audio sample with the cluster in response to the probability score being greater than the second predetermined threshold. 11 . The system of claim 8 , wherein the first audio sample and the second audio sample do not overlap in view of the first audio dataset. 12 . The system of claim 8 , wherein the training of the machine learning model is performed via a self-supervised contrastive learning objectives. 13 . The system of claim 8 , wherein the first audio dataset and the second audio dataset do not include human annotations. 14 . The system of claim 8 , wherein the mapping of the third audio sample and the fourth audio sample in the latent space visualization creates a cluster and the one or more processors, are further configured to: determine a shared attribute between the third audio sample and the fourth audio sample; and label the cluster based on the shared attributes of the third audio sample and the fourth audio sample. 15 . A non-transitory computer-readable medium storing a set of instructions for detecting an anomaly in audio data, the set of instructions comprising: one or more instructions that, when executed by one or more processors of a device, cause the device to: receive, from a microphone

Assignees

Inventors

Classifications

  • using neural networks · CPC title

  • G06F40/51Primary

    Translation evaluation · CPC title

  • G10L25/51Primary

    for comparison or discrimination · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12554943B2 cover?
In some implementations, the device may include receiving a first and second audio dataset. In addition, the device may generate a first, a second, a third, and a fourth audio sample. Moreover, the device may include determining a level of similarity between the first and second audio samples. Also, the device may include combining the first and second audio samples into an audio pair. Further,…
Who is the assignee on this patent?
Bosch Gmbh Robert
What technology area does this patent fall under?
Primary CPC classification G06F40/51. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 17 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).