Voice data transmission method and apparatus
US-2024363120-A1 · Oct 31, 2024 · US
US2021216817A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021216817-A1 |
| Application number | US-202016844930-A |
| Country | US |
| Kind code | A1 |
| Filing date | Apr 9, 2020 |
| Priority date | Jan 14, 2020 |
| Publication date | Jul 15, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computing system includes an encoder that receives an input image and encodes the input image into real image features, a decoder that decodes the real image features into a reconstructed image, a generator that receives first audio data corresponding to the input image and generates first synthetic image features from the first audio data, and receives second audio data and generates second synthetic image features from the second audio data, a discriminator that receives both the real and synthetic image features and determines whether a target feature is real or synthetic, and a classifier that classifies a scene of the second audio data based on the second synthetic image features.
Opening claim text (preview).
1 . A computing system comprising: a processor having associated memory storing instructions that cause the processor to execute, at training time, for each of a plurality of input images: an encoder configured to receive an input image of the plurality of input images and encode the input image into real image features; a decoder configured to receive from the encoder the real image features and decode the real image features into a reconstructed image; a generator configured to receive first audio data corresponding to the input image and generate first synthetic image features from the first audio data, and to receive second audio data and generate second synthetic image features from the second audio data; a discriminator configured to receive the real image features and first synthetic image features and to output a determination of whether a target feature is real or synthetic; and a classifier configured to receive the second synthetic image features and classify a scene of the second audio data based on the second synthetic image features. 2 . The computing system of claim 1 , wherein the decoder is further configured to construct a first synthetic image from the first synthetic image features and a second synthetic image from the second synthetic image features. 3 . The computing system of claim 2 , wherein the processor is further configured to loop through: training the encoder and the decoder to increase a correlation of each of the reconstructed image and the first synthetic image to the respective input image; training the generator, based on the determination output by the discriminator; and training the discriminator while the encoder is fixed. 4 . The computing system of claim 3 , wherein the processor is further configured to train the classifier while the encoder, decoder, generator, and discriminator are fixed. 5 . The computing system of claim 1 , wherein the first audio data corresponds to the input image in an audio-visual pair recorded together, the second audio data is not paired with an image, and the first audio data and the second audio data are recordings generated at substantially different geographical locations. 6 . The computing system of claim 1 , wherein the encoder, the decoder, the generator, the discriminator, and the classifier constitute an audio-visual generative adversarial network, the encoder and the decoder include vector quantized variational autoencoder architecture, and the classifier includes convolutional neural network (CNN) architecture. 7 . The computing system of claim 1 , wherein the processor is further configured to execute, at runtime: the generator, which is further configured to generate third synthetic image features from third audio data; and the classifier, which is further configured to classify a scene of the third audio data based on the third synthetic image features. 8 . The computing system of claim 7 , wherein the processor is further configured to, at runtime: execute the decoder, which is further configured to receive the third synthetic image features and construct a third synthetic image from the third synthetic image features; and display the third synthetic image as a background image of a participant in a video chat, the third synthetic image including generic features relating to the classified scene of the third audio data and lacking private identifying features of a real-world background of the participant. 9 . The computing system of claim 7 , wherein the processor is further configured to use the classified scene of the third audio data as a factor in authentication of a user. 10 . The computing system of claim 7 , wherein the processor is further configured to augment a navigation service based on comparing the classified scene of the third audio data to a scene of one or more known locations. 11 . A method comprising, at a processor at training time of a neural network, for each of a plurality of input images: receiving an input image of the plurality of input images and encoding the input image into real image features; decoding the real image features into a reconstructed image; receiving first audio data corresponding to the input image and generating first synthetic image features from the first audio data, and receiving second audio data and generating second synthetic image features from the second audio data; outputting a determination of whether a target feature, of the real image features and first synthetic image features, is real or synthetic; and classifying a scene of the second audio data based on the second synthetic image features. 12 . The method of claim 11 , further comprising constructing a first synthetic image from the first synthetic image features and a second synthetic image from the second synthetic image features. 13 . The method of claim 12 , further comprising looping through: training an encoder and a decoder to increase a correlation of each of the reconstructed image and the first synthetic image to the respective input image; training a generator to create the first synthetic image features, based on the determination output by a discriminator; and training the discriminator while the encoder is fixed. 14 . The method of claim 13 , further comprising training a classifier to classify the scene while the encoder, decoder, generator, and discriminator are fixed. 15 . The method of claim 14 , wherein the encoder, the decoder, the generator, the discriminator, and the classifier constitute an audio-visual generative adversarial network, the encoder and the decoder include vector quantized variational autoencoder architecture, and the classifier includes convolutional neural network (CNN) architecture. 16 . The method of claim 11 , wherein the first audio data corresponds to the input image in an audio-visual pair recorded together, the second audio data is not paired with an image, and the first audio data and the second audio data are recordings generated at substantially different geographical locations. 17 . The method of claim 11 , further comprising at the processor, at runtime: generating third synthetic image features from third audio data; and classifying a scene of the third audio data based on the third synthetic image features. 18 . The method of claim 17 , further comprising, at runtime: constructing a third synthetic image from the third synthetic image features; and displaying the third synthetic image as a background image of a participant in a video chat, the third synthetic image including generic features relating to the classified scene of the third audio data and lacking private identifying features of a real-world background of the participant. 19 . The method of claim 17 , further comprising using the classified scene of the third audio data as a factor in authentication of a user. 20 . A computing system comprising: a processor having associated memory storing: a discriminator configured to determine whether a target feature is real or synthetic; a generator having been trained on an audio-visual pair of image data and first audio data with the discriminator; a classifier having been trained on second audio data; and instructions that cause the processor to execute, at runtime: the generator configured to generate synthetic image features from third audio data; and the classifier configured to classify a scene of the third audio data based on the synthetic image features.
for comparison or discrimination · CPC title
Scenes; Scene-specific elements (control of digital cameras H04N23/60) · CPC title
using neural networks · CPC title
Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN] · CPC title
using classification, e.g. of video objects · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.