Method for training a neural network to describe an environment on the basis of an audio signal, and the corresponding neural network

US12288567B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12288567-B2
Application numberUS-202017792073-A
CountryUS
Kind codeB2
Filing dateJan 10, 2020
Priority dateJan 10, 2020
Publication dateApr 29, 2025
Grant dateApr 29, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A neural network, a system using this neural network and a method for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the method including: obtaining audio and image training signals of a scene showing an environment with objects generating sounds, obtaining a target description of the environment seen on the image training signal, inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and comparing the target description of the environment with the training description of the environment.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the method comprising: obtaining audio and image training signals of a scene showing an environment with objects generating sounds, obtaining a target description of the environment seen on the image training signal, inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and comparing the target description of the environment with the training description of the environment, wherein the description of the environment, the target description of the environment, and the training description of the environment include at least one of a semantic segmentation of a frame of the image training signal or a depth map of a frame of the image training signal. 2. The method of claim 1 , wherein the audio training signal is acquired with a plurality of sound acquisition devices. 3. The method of claim 2 , wherein the sound acquisition devices of the plurality of sound acquisition devices are all spaced apart from each other. 4. The method of claim 2 , wherein at least one additional sound acquisition device is used to acquire an audio signal at a location which differs from the location of any one of the sound acquisition devices of the plurality of sound acquisition devices, the neural network being further configured to determine at least one predicted audio signal representative of the audio signal that is acquired by the at least one additional sound acquisition device, and the method further comprising comparing the predicted audio signal with an audio signal acquired by the at least one additional sound acquisition device. 5. The method of claim 1 , wherein the audio training signal is acquired using at least one binaural sound acquisition device. 6. The method of claim 1 , wherein the image training signal is acquired using a 360 degrees camera. 7. The method of claim 1 , wherein the target description is obtained using at least one pre-trained neural network configured to receive an image signal as input and to output the target description. 8. A neural network trained using the method of claim 1 . 9. The neural network of claim 8 , comprising, for each possible audio signal to be used as input, four convolutional layers, a concatenation module for concatenating the outputs of every four convolutional layers, and an ASPP module. 10. A system comprising at least one sound acquisition device and a neural network in accordance with claim 8 . 11. A vehicle comprising a system according to claim 10 . 12. A system for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the system comprising: a module for obtaining audio and image training signals of a scene showing an environment with objects generating sounds, a module for obtaining a target description of the environment seen on the image training signal, a module for inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and a module for comparing the target description of the environment with the training description of the environment, wherein the description of the environment, the target description of the environment, and the training description of the environment include at least one of a semantic segmentation of a frame of the image training signal or a depth map of a frame of the image training signal. 13. A non-transitory recording medium readable by a computer and having recorded thereon a computer program including instructions for executing a method for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the method comprising: obtaining audio and image training signals of a scene showing an environment with objects generating sounds, obtaining a target description of the environment seen on the image training signal, inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and comparing the target description of the environment with the training description of the environment, wherein the description of the environment, the target description of the environment, and the training description of the environment include at least one of a semantic segmentation of a frame of the image training signal or a depth map of a frame of the image training signal.

Assignees

Inventors

Classifications

  • Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Terrestrial scenes (scenes under surveillance with static cameras G06V20/52; scenes perceived from the exterior of a vehicle G06V20/56; scenes perceived from the interior of a vehicle G06V20/59) · CPC title

  • structured as a network, e.g. client-server architectures · CPC title

  • Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion · CPC title

  • using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12288567B2 cover?
A neural network, a system using this neural network and a method for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the method including: obtaining audio and image training signals of a scene showing an environment with objects generating soun…
Who is the assignee on this patent?
Toyota Motor Europe, Eth Zuerich, Toyota Motor Co Ltd, and 1 more
What technology area does this patent fall under?
Primary CPC classification G10L25/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 29 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).