System and method for semantic analysis of multimedia data using attention-based fusion network
US-2021216862-A1 · Jul 15, 2021 · US
US2022012425A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022012425-A1 |
| Application number | US-202016926525-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jul 10, 2020 |
| Priority date | Jul 10, 2020 |
| Publication date | Jan 13, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Described herein are embodiments of a framework named as total correlation variational autoencoder (TC_VAE) to disentangle syntax and semantics by making use of total correlation penalties of KL divergences. One or more Kullback-Leibler (KL) divergence terms in a loss for a variational autoencoder are discomposed so that generated hidden variables may be separated. Embodiments of the TC_VAE framework were examined on semantic similarity tasks and syntactic similarity tasks. Experimental results show that better disentanglement between syntactic and semantic representations have been achieved compared with state-of-the-art (SOTA) results on the same data sets in similar settings.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method for segmenting latent representations comprising: generating, using an embedding layer, a sequence of embeddings for a sequence of tokens; generating, using an attention layer, a sequence of attention masks based on the sequence of embeddings; generating a sequence of hidden variables based on the sequence of embeddings and the sequence of attention masks; generating, using a first encoder and a second encoder respectively, a first sequence of latent variables and a second sequence of latent variables based on the sequence of hidden variables; and inferring, using a decoder, a sequence of reconstructed tokens and a sequence of reconstructed attention masks based on at least information of the first and second sequences of latent variables. 2 . The computer-implemented method of claim 1 wherein each hidden variable in the sequence of hidden variables is generated by an element-wise multiplication between an embedding of the sequence of embeddings and a corresponding attention masks of the sequence of attention masks. 3 . The computer-implemented method of claim 1 wherein inferring a sequence of reconstructed tokens and a sequence of reconstructed attention masks based on at least information of at least the first and second sequences of latent variables comprising: combining a sequence of global latent variables with the first sequence of latent variables and the second sequence of latent variables to generate a first sequence of combined latent variables and a second sequence of combined latent variables respectively; receiving, at the decoder, the first sequence of combined latent variables and the second sequence of combined latent variables; and inferring the sequence of reconstructed tokens and the sequence of reconstructed attention masks. 4 . The computer-implemented method of claim 3 further comprising: using the sequence of reconstructed tokens and the sequence of reconstructed attention masks to establish a loss to train at least the attention layer, the first encoder, the second encoder, and the decoder. 5 . The computer-implemented method of claim 4 wherein the loss comprises one or more total correlation (TC) terms to enforce disentanglement of latent variables. 6 . The computer-implemented method of claim 5 wherein the one or more TC terms comprise a first Kullback-Leibler (KL) divergence for the first encoder and a second KL divergence for the second encoder. 7 . The computer-implemented method of claim 6 wherein the first KL divergence is a KL divergence between a distribution of the first sequence of combined latent variables and a product of a factorial distribution for each latent variable in the first sequence of latent variables and a factorial distribution for each global latent variable in the first combined sequence, the second KL divergence is a KL divergence between a distribution of the second sequence of combined latent variables and a product of a factorial distribution for each latent variable in the second sequence of latent variables and a factorial distribution for each global latent variable in the second combined sequence. 8 . The computer-implemented method of claim 1 wherein the first encoder is a semantic encoder, the second encoder is a syntax encoder. 9 . A system for segmenting latent representations comprising: one or more processors; and a non-transitory computer-readable medium or media comprising one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: generating a sequence of embeddings for a sequence of tokens; generating a sequence of attention masks based on the sequence of embeddings; generating a sequence of hidden variables based on the sequence of embeddings with the sequence of attention masks; generating respectively a first sequence of latent variables and a second sequence of latent variables based on the sequence of hidden variables; and inferring a sequence of reconstructed tokens and a sequence of reconstructed attention masks based on at least information of the first and second sequences of latent variables. 10 . The system of claim 9 wherein each hidden variable in the sequence of hidden variables is generated by an element-wise multiplication between an embedding of the sequence of embeddings and a corresponding attention masks of the sequence of attention masks 11 . The system of claim 9 wherein inferring a sequence of reconstructed tokens and a sequence of reconstructed attention masks based on at least information of at least the first and second sequences of latent variables comprises steps of: combining a sequence of global latent variables with the first sequence of latent variables and the second sequence of latent variables to generate a first sequence of combined latent variables and a second sequence of combined latent variables respectively; receiving, at the decoder, the first sequence of combined latent variables and the second sequence of combined latent variables; and inferring the sequence of reconstructed tokens and the sequence of reconstructed attention masks. 12 . The system of claim 11 wherein the non-transitory computer-readable medium or media further comprises one or more sets of instructions which, when executed by at least one of the one or more processors, causes steps to be performed comprising: using the sequence of reconstructed tokens and the sequence of reconstructed attention masks to establish a loss for system training. 13 . The system of claim 12 wherein the loss comprises a total correlation (TC) terms to enforce disentanglement of latent variables, the one or more TC terms comprise a first Kullback-Leibler (KL) divergence for the first encoder and a second KL divergence for the second encoder. 14 . The system of claim 13 wherein the first KL divergence is a KL divergence between a distribution of the first sequence of combined latent variables and a product of a factorial distribution for each latent variable in the first sequence of latent variables and a factorial distribution for each global latent variable in the first combined sequence, the second KL divergence is a KL divergence between a distribution of the second sequence of combined latent variables and a product of a factorial distribution for each latent variable in the second sequence of latent variables and a factorial distribution for each global latent variable in the second combined sequence. 15 . A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by at least one processor, causes steps for segmenting latent representations comprising: generating, using an embedding layer, a sequence of embeddings for a sequence of tokens; generating, using an attention layer, a sequence of attention masks based on the sequence of embeddings; generating a sequence of hidden variables based on the sequence of embeddings with the sequence of attention masks; generating, using a first encoder and a second encoder respectively, a first sequence of latent variables and a second sequence of latent variables based on the sequence of hidden variables; and inferring, using a decoder, a sequence of reconstructed tokens and a sequence of reconstructed attention masks based on at least information of at least the first and second sequences of latent variables. 16 . The non-transitory computer-readable medium or media of claim 15 wherein each hidden variable in the sequence of hidden va
Combinations of networks · CPC title
Probabilistic or stochastic networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Generative networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.