Systems and methods for automatically generating sound event subtitles
US-12075187-B2 · Aug 27, 2024 · US
US12288567B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12288567-B2 |
| Application number | US-202017792073-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 10, 2020 |
| Priority date | Jan 10, 2020 |
| Publication date | Apr 29, 2025 |
| Grant date | Apr 29, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A neural network, a system using this neural network and a method for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the method including: obtaining audio and image training signals of a scene showing an environment with objects generating sounds, obtaining a target description of the environment seen on the image training signal, inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and comparing the target description of the environment with the training description of the environment.
Opening claim text (preview).
The invention claimed is: 1. A method for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the method comprising: obtaining audio and image training signals of a scene showing an environment with objects generating sounds, obtaining a target description of the environment seen on the image training signal, inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and comparing the target description of the environment with the training description of the environment, wherein the description of the environment, the target description of the environment, and the training description of the environment include at least one of a semantic segmentation of a frame of the image training signal or a depth map of a frame of the image training signal. 2. The method of claim 1 , wherein the audio training signal is acquired with a plurality of sound acquisition devices. 3. The method of claim 2 , wherein the sound acquisition devices of the plurality of sound acquisition devices are all spaced apart from each other. 4. The method of claim 2 , wherein at least one additional sound acquisition device is used to acquire an audio signal at a location which differs from the location of any one of the sound acquisition devices of the plurality of sound acquisition devices, the neural network being further configured to determine at least one predicted audio signal representative of the audio signal that is acquired by the at least one additional sound acquisition device, and the method further comprising comparing the predicted audio signal with an audio signal acquired by the at least one additional sound acquisition device. 5. The method of claim 1 , wherein the audio training signal is acquired using at least one binaural sound acquisition device. 6. The method of claim 1 , wherein the image training signal is acquired using a 360 degrees camera. 7. The method of claim 1 , wherein the target description is obtained using at least one pre-trained neural network configured to receive an image signal as input and to output the target description. 8. A neural network trained using the method of claim 1 . 9. The neural network of claim 8 , comprising, for each possible audio signal to be used as input, four convolutional layers, a concatenation module for concatenating the outputs of every four convolutional layers, and an ASPP module. 10. A system comprising at least one sound acquisition device and a neural network in accordance with claim 8 . 11. A vehicle comprising a system according to claim 10 . 12. A system for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the system comprising: a module for obtaining audio and image training signals of a scene showing an environment with objects generating sounds, a module for obtaining a target description of the environment seen on the image training signal, a module for inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and a module for comparing the target description of the environment with the training description of the environment, wherein the description of the environment, the target description of the environment, and the training description of the environment include at least one of a semantic segmentation of a frame of the image training signal or a depth map of a frame of the image training signal. 13. A non-transitory recording medium readable by a computer and having recorded thereon a computer program including instructions for executing a method for training a neural network to output a description of the environment in the vicinity of at least one sound acquisition device on the basis of an audio signal acquired by the sound acquisition device, the method comprising: obtaining audio and image training signals of a scene showing an environment with objects generating sounds, obtaining a target description of the environment seen on the image training signal, inputting the audio training signal to the neural network so that the neural network outputs a training description of the environment, and comparing the target description of the environment with the training description of the environment, wherein the description of the environment, the target description of the environment, and the training description of the environment include at least one of a semantic segmentation of a frame of the image training signal or a depth map of a frame of the image training signal.
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Terrestrial scenes (scenes under surveillance with static cameras G06V20/52; scenes perceived from the exterior of a vehicle G06V20/56; scenes perceived from the interior of a vehicle G06V20/59) · CPC title
structured as a network, e.g. client-server architectures · CPC title
Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion · CPC title
using neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.