Classifying audio scene using synthetic image features

US2021216817A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021216817-A1
Application numberUS-202016844930-A
CountryUS
Kind codeA1
Filing dateApr 9, 2020
Priority dateJan 14, 2020
Publication dateJul 15, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computing system includes an encoder that receives an input image and encodes the input image into real image features, a decoder that decodes the real image features into a reconstructed image, a generator that receives first audio data corresponding to the input image and generates first synthetic image features from the first audio data, and receives second audio data and generates second synthetic image features from the second audio data, a discriminator that receives both the real and synthetic image features and determines whether a target feature is real or synthetic, and a classifier that classifies a scene of the second audio data based on the second synthetic image features.

First claim

Opening claim text (preview).

1 . A computing system comprising: a processor having associated memory storing instructions that cause the processor to execute, at training time, for each of a plurality of input images: an encoder configured to receive an input image of the plurality of input images and encode the input image into real image features; a decoder configured to receive from the encoder the real image features and decode the real image features into a reconstructed image; a generator configured to receive first audio data corresponding to the input image and generate first synthetic image features from the first audio data, and to receive second audio data and generate second synthetic image features from the second audio data; a discriminator configured to receive the real image features and first synthetic image features and to output a determination of whether a target feature is real or synthetic; and a classifier configured to receive the second synthetic image features and classify a scene of the second audio data based on the second synthetic image features. 2 . The computing system of claim 1 , wherein the decoder is further configured to construct a first synthetic image from the first synthetic image features and a second synthetic image from the second synthetic image features. 3 . The computing system of claim 2 , wherein the processor is further configured to loop through: training the encoder and the decoder to increase a correlation of each of the reconstructed image and the first synthetic image to the respective input image; training the generator, based on the determination output by the discriminator; and training the discriminator while the encoder is fixed. 4 . The computing system of claim 3 , wherein the processor is further configured to train the classifier while the encoder, decoder, generator, and discriminator are fixed. 5 . The computing system of claim 1 , wherein the first audio data corresponds to the input image in an audio-visual pair recorded together, the second audio data is not paired with an image, and the first audio data and the second audio data are recordings generated at substantially different geographical locations. 6 . The computing system of claim 1 , wherein the encoder, the decoder, the generator, the discriminator, and the classifier constitute an audio-visual generative adversarial network, the encoder and the decoder include vector quantized variational autoencoder architecture, and the classifier includes convolutional neural network (CNN) architecture. 7 . The computing system of claim 1 , wherein the processor is further configured to execute, at runtime: the generator, which is further configured to generate third synthetic image features from third audio data; and the classifier, which is further configured to classify a scene of the third audio data based on the third synthetic image features. 8 . The computing system of claim 7 , wherein the processor is further configured to, at runtime: execute the decoder, which is further configured to receive the third synthetic image features and construct a third synthetic image from the third synthetic image features; and display the third synthetic image as a background image of a participant in a video chat, the third synthetic image including generic features relating to the classified scene of the third audio data and lacking private identifying features of a real-world background of the participant. 9 . The computing system of claim 7 , wherein the processor is further configured to use the classified scene of the third audio data as a factor in authentication of a user. 10 . The computing system of claim 7 , wherein the processor is further configured to augment a navigation service based on comparing the classified scene of the third audio data to a scene of one or more known locations. 11 . A method comprising, at a processor at training time of a neural network, for each of a plurality of input images: receiving an input image of the plurality of input images and encoding the input image into real image features; decoding the real image features into a reconstructed image; receiving first audio data corresponding to the input image and generating first synthetic image features from the first audio data, and receiving second audio data and generating second synthetic image features from the second audio data; outputting a determination of whether a target feature, of the real image features and first synthetic image features, is real or synthetic; and classifying a scene of the second audio data based on the second synthetic image features. 12 . The method of claim 11 , further comprising constructing a first synthetic image from the first synthetic image features and a second synthetic image from the second synthetic image features. 13 . The method of claim 12 , further comprising looping through: training an encoder and a decoder to increase a correlation of each of the reconstructed image and the first synthetic image to the respective input image; training a generator to create the first synthetic image features, based on the determination output by a discriminator; and training the discriminator while the encoder is fixed. 14 . The method of claim 13 , further comprising training a classifier to classify the scene while the encoder, decoder, generator, and discriminator are fixed. 15 . The method of claim 14 , wherein the encoder, the decoder, the generator, the discriminator, and the classifier constitute an audio-visual generative adversarial network, the encoder and the decoder include vector quantized variational autoencoder architecture, and the classifier includes convolutional neural network (CNN) architecture. 16 . The method of claim 11 , wherein the first audio data corresponds to the input image in an audio-visual pair recorded together, the second audio data is not paired with an image, and the first audio data and the second audio data are recordings generated at substantially different geographical locations. 17 . The method of claim 11 , further comprising at the processor, at runtime: generating third synthetic image features from third audio data; and classifying a scene of the third audio data based on the third synthetic image features. 18 . The method of claim 17 , further comprising, at runtime: constructing a third synthetic image from the third synthetic image features; and displaying the third synthetic image as a background image of a participant in a video chat, the third synthetic image including generic features relating to the classified scene of the third audio data and lacking private identifying features of a real-world background of the participant. 19 . The method of claim 17 , further comprising using the classified scene of the third audio data as a factor in authentication of a user. 20 . A computing system comprising: a processor having associated memory storing: a discriminator configured to determine whether a target feature is real or synthetic; a generator having been trained on an audio-visual pair of image data and first audio data with the discriminator; a classifier having been trained on second audio data; and instructions that cause the processor to execute, at runtime: the generator configured to generate synthetic image features from third audio data; and the classifier configured to classify a scene of the third audio data based on the synthetic image features.

Assignees

Inventors

Classifications

  • G10L25/51Primary

    for comparison or discrimination · CPC title

  • Scenes; Scene-specific elements (control of digital cameras H04N23/60) · CPC title

  • using neural networks · CPC title

  • Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN] · CPC title

  • using classification, e.g. of video objects · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021216817A1 cover?
A computing system includes an encoder that receives an input image and encodes the input image into real image features, a decoder that decodes the real image features into a reconstructed image, a generator that receives first audio data corresponding to the input image and generates first synthetic image features from the first audio data, and receives second audio data and generates second …
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G10L25/51. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jul 15 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).