Topical vector-quantized variational autoencoders for extractive summarization of video transcripts

US12147771B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12147771-B2
Application numberUS-202117361878-A
CountryUS
Kind codeB2
Filing dateJun 29, 2021
Priority dateJun 29, 2021
Publication dateNov 19, 2024
Grant dateNov 19, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

System and methods for a text summarization system are described. In one example, a text summarization system receives an input utterance and determines whether the utterance should be included in a summary of the text. The text summarization system includes an embedding network, a convolution network, an encoding component, and a summary component. The embedding network generates a semantic embedding of an utterance. The convolution network generates a plurality of feature vectors based on the semantic embedding. The encoding component identifies a plurality of latent codes respectively corresponding to the plurality of feature vectors. The summary component identifies a prominent code among the latent codes and to select the utterance as a summary utterance based on the prominent code.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving text including an utterance; generating a semantic embedding of the utterance using an embedding network; generating a plurality of feature vectors based on the semantic embedding using a convolution network; identifying a first plurality of latent codes respectively corresponding to the plurality of feature vectors by identifying a closest latent code from a second plurality of latent codes of a codebook to each corresponding feature vector of the plurality of feature vectors, wherein the second plurality of latent codes of the codebook discretizes a semantic space based on a number of dimensions of the semantic space, and wherein the closest latent code is identified by computing a similarity between the closest latent code and the corresponding feature vector; identifying a prominent code among the first plurality of latent codes; and generating an indication that the utterance is a summary utterance based on the prominent code. 2. The method of claim 1 , further comprising: receiving audio information; and converting the audio information to produce the text. 3. The method of claim 2 , wherein: the audio information is received in a streaming format, and the utterance is selected as the summary utterance in real time. 4. The method of claim 2 , further comprising: receiving video information; and identifying the audio information from the video information. 5. The method of claim 1 , further comprising: identifying a plurality of summary utterances for the text; and generating a summary for the text based on the plurality of summary utterances. 6. The method of claim 1 , further comprising: appending a sentence tag to the utterance, wherein the semantic embedding of the utterance corresponds to an output of the embedding network corresponding to the sentence tag. 7. The method of claim 1 , wherein: a number of the latent codes in the second plurality of latent codes of the codebook is equal to a number of dimensions of the semantic embedding. 8. The method of claim 1 , wherein: a number of dimensions of the first plurality of latent codes is equal to a number of filters of the convolution network. 9. The method of claim 1 , further comprising: computing a Euclidean distance between each of the feature vectors and each of the second plurality of latent codes from the codebook, wherein the closest latent code is identified based on the Euclidean distance. 10. The method of claim 1 , further comprising: identifying a plurality of text segments in the text; identifying a frequency for each latent code of the second plurality of latent codes from the codebook in each of the text segments; and identifying a set of prominent codes based on the frequency, wherein the prominent code is an element of the set of prominent codes. 11. The method of claim 10 , further comprising: identifying a most frequent code from each of the text segments, wherein the set of prominent codes includes the most frequent code from each of the text segments. 12. The method of claim 10 , further comprising: identifying a set of segment codes associated with a text segment associated with a predetermined location within the text; and refraining from including the set of segment codes in the set of prominent codes based on the association with the text segment, wherein the set of prominent codes includes the prominent code. 13. An apparatus comprising: an embedding network configured to generate a semantic embedding of an utterance; a convolution network generates a plurality of feature vectors based on the semantic embedding; and an encoding component configured to identify a plurality of first latent codes respectively corresponding to the plurality of feature vectors by identifying a closest latent code from a second plurality of latent codes of a codebook to each corresponding feature vector of the plurality of feature vectors, wherein the second plurality of latent codes discretizes a semantic space based on a number of dimensions of the semantic space, and wherein the closest latent code is identified by computing a similarity between the closest latent code and the corresponding feature vector; and a summary component configured to identify a prominent code among the first plurality of latent codes and to select the utterance as a summary utterance based on the prominent code. 14. The apparatus of claim 13 , further comprising: an audio converter configured to receive audio information and convert the audio information to text, wherein the utterance is identified from the text. 15. The apparatus of claim 13 , further comprising: a user interface configured to display the summary utterance. 16. The apparatus of claim 13 , wherein: the summary component is further configured to generate a summary for a text based on the summary utterance. 17. A method of training a neural network, the method comprising: receiving a training set including an input utterance; generating a semantic embedding of the input utterance using an embedding network; generating a plurality of feature vectors based on the semantic embedding using a convolution network; identifying a first plurality of latent codes respectively corresponding to the plurality of feature vectors by identifying a closest latent code from a second plurality of latent codes of a codebook to each corresponding feature vector of the plurality of feature vectors, wherein the second plurality of latent codes discretizes a semantic space based on a number of dimensions of the semantic space, and wherein the closest latent code is identified by computing a similarity between the closest latent code and the corresponding feature vector; generating an output embedding based on the first plurality of latent codes using a convolutional decoder; generating an output text based on the output embedding; computing an autoencoder loss by comparing the input utterance and the output text; and updating parameters of the convolution network based on the autoencoder loss. 18. The method of claim 17 , further comprising: computing a codebook loss by comparing each of the plurality of feature vectors with a corresponding latent code from the first plurality of latent codes, wherein the parameters are updated based on the codebook loss. 19. The method of claim 18 , wherein: the codebook loss is based on a stop-gradient operator on the each of the plurality of feature vectors, a corresponding latent code from the plurality of latent codes, or both. 20. The method of claim 17 , further comprising: updating the codebook based on the autoencoder loss.

Assignees

Inventors

Classifications

  • Recognition of textual entities · CPC title

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Semantic analysis · CPC title

  • G06F40/35Primary

    Discourse or dialogue representation · CPC title

  • G06F40/216Primary

    using statistical methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12147771B2 cover?
System and methods for a text summarization system are described. In one example, a text summarization system receives an input utterance and determines whether the utterance should be included in a summary of the text. The text summarization system includes an embedding network, a convolution network, an encoding component, and a summary component. The embedding network generates a semantic em…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/35. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).