System and method for multi-spoken language detection
US-2020219492-A1 · Jul 9, 2020 · US
US11308329B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11308329-B2 |
| Application number | US-202016868805-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 7, 2020 |
| Priority date | May 7, 2020 |
| Publication date | Apr 19, 2022 |
| Grant date | Apr 19, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer system is trained to understand audio-visual spatial correspondence using audio-visual clips having multi-channel audio. The computer system includes an audio subnetwork, video subnetwork, and pretext subnetwork. The audio subnetwork receives the two channels of audio from the audio-visual clips, and the video subnetwork receives the video frames from the audio-visual clips. In a subset of the audio-visual clips the audio-visual spatial relationship is misaligned, causing the audio-visual spatial cues for the audio and video to be incorrect. The audio subnetwork outputs an audio feature vector for each audio-visual clip, and the video subnetwork outputs a video feature vector for each audio-visual clip. The audio and video feature vectors for each audio-visual clip are merged and provided to the pretext subnetwork, which is configured to classify the merged vector as either having a misaligned audio-visual spatial relationship or not. The subnetworks are trained based on the loss calculated from the classification.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for training a network to interpret audio-visual spatial correspondence, the method comprising: obtaining, by a preprocessing subsystem, a plurality of audio-visual samples; extracting, by the preprocessing subsystem for each audio-visual sample, a plurality of audio channels and a plurality of video frames; misaligning, by the preprocessing subsystem, an audio-visual spatial relationship in each of a first subset of the plurality of audio-visual samples; and for each of the plurality of audio-visual samples: calculating, with an audio subnetwork of the network, an audio feature vector for the respective audio-visual sample based on the plurality of audio channels of the respective audio-visual sample, wherein calculating the audio feature vector for the respective audio-visual sample comprises receiving, at the audio subnetwork, a first audio representation representing a first audio channel of the plurality of audio channels stacked with a second audio representation representing a second audio channel of the plurality of audio channels; calculating, with a visual subnetwork of the network, a visual feature vector for the respective audio-visual sample based on the plurality of video frames of the respective audio-visual sample; merging, with a merging subsystem of the network, the audio feature vector with the visual feature vector over a time domain to generate an audio-visual vector; classifying, with a pretext subnetwork of the network, the audio-visual vector into one of a first subgroup or a second subgroup, wherein the first subgroup classification indicates the audio-visual spatial relationship is misaligned in the respective audio-visual sample; and adjusting, by the network, parameters of the audio subnetwork, the visual subnetwork, and the pretext subnetwork based on a loss calculated based on classification of the audio-visual vector. 2. The method of claim 1 , wherein misaligning the audio-visual spatial relationship comprises randomly selecting the first subset from the plurality of audiovisual samples. 3. The method of claim 1 , wherein misaligning the audio-visual spatial relationship comprises randomly selecting the first subset from the plurality of audio-visual samples using a probability of 0.5. 4. The method of claim 1 , wherein: the audio-visual samples comprise field-of-view video with binaural audio; the audio-visual spatial relationship between a first audio channel in a first audiovisual sample and the plurality of video frames in the first audio-visual sample represents, before the misaligning, audio generated from an object represented in a left portion of the plurality of video frames; the audio-visual spatial relationship between a second audio channel in the first audio-visual sample and the plurality of video frames in the first audio-visual sample represents, before the misaligning, audio generated from an object represented in a right portion of the plurality of video frames; and wherein the misaligning comprises switching the first audio channel and the second audio channel. 5. The method of claim 1 , wherein the first audio representation is one of a spectrogram, a mel-spectrogram, or a raw audio waveform. 6. The method of claim 1 , wherein classifying the audio-visual vector comprises average pooling across a single dimension. 7. The method of claim 1 , wherein adjusting parameters of the audio subnetwork, the visual subnetwork, and the pretext subnetwork comprises: receiving, by a loss function subsystem, a known classification for the audiovisual vector; receiving, by the loss function subsystem, the classification of the audio-visual vector from the pretext subnetwork; calculating the loss based on a determination of whether the pretext subnetwork correctly classified the audio-visual vector using the known classification; and provide the loss to the audio subnetwork, the visual subnetwork, and the pretext subnetwork. 8. The method of claim 1 , wherein merging the audio feature vector with the visual feature vector over the time domain comprises: reducing and flattening the visual feature vector without spatial pooling to generate a reduced visual feature vector; and merging the audio feature vector with the reduced visual feature vector over the time domain. 9. The method of claim 1 , further comprising: misaligning, by the preprocessing subsystem, the audio-visual spatial relationship in each of a second subset of the plurality of audio-visual samples by modifying the plurality of audio channels; and realigning, by the preprocessing subsystem, the audio-visual spatial relationship in each of the second subset of the plurality of audio-visual samples by modifying the plurality of video frames. 10. The method of claim 1 , wherein the audio-visual samples comprise 360-degree video and ambisonic audio. 11. A system for training a network to interpret audiovisual spatial correspondence, the system comprising: one or more processors; and a memory having stored thereon instructions that, upon execution by the one or more processors, cause the one or more processors to: receive a plurality of audio-visual samples; extract, for each audio-visual sample, a plurality of audio channels and a plurality of video frames; misalign an audio-visual spatial relationship in each of a first subset of the plurality of audio-visual samples; and for each of the plurality of audio-visual samples: calculate, with an audio subnetwork of the system, an audio feature vector for the respective audio-visual sample based on the plurality of audio channels of the respective audio-visual sample, wherein calculating the audio feature vector for the respective audio-visual sample comprises receiving, at the audio subnetwork, a plurality of mel-spectrograms, wherein each mel-spectrogram represents an audio channel of the plurality of audio channels, and wherein the plurality of mel-spectrograms are stacked; calculate, with a visual subnetwork of the system, a visual feature vector for the respective audio-visual sample based on the plurality of video frames of the respective audio-visual sample; merge, with a merging subsystem of the system, the audio feature vector with the visual feature vector over a time domain to generate an audio-visual vector; classify, with a pretext subnetwork of the system, the audio-visual vector into one of a first subgroup or a second subgroup, where the first subgroup classification indicates the audio-visual spatial relationship is misaligned in the respective audio-visual sample; and adjust parameters of the audio subnetwork, visual subnetwork, and pretext subnetwork based on a loss calculated based on the classifying the audio-visual vector. 12. The system of claim 11 , wherein the instructions to misalign the audio-visual spatial relationship comprises further instructions that, upon execution by the one or more processors, causes the one or more processors to randomly select the first subset from the plurality of audio-visual samples using a probability of 0.5. 13. The system of claim 11 , wherein the audio-visual samples comprise 360-degree video and ambisonic audio. 14. The system of claim 11 , wherein the instructions to classify the audio-visual vector comprises further instructions that, upon execution by the one or more processors, causes the one or more processors to use average pooling across a single dimension. 15. The system of claim 11 , wherein the instructions to adjust the parameters of the audio subnetwork, visual subnetwork, and pretext subnetwork comprises further instructions that, upon execution by the one or more processors, causes t
Electronic adaptation of stereophonic sound system to listener position or orientation (H04S7/301 takes precedence) · CPC title
of extracted features · CPC title
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.