Dense video captioning
US-10542270-B2 · Jan 21, 2020 · US
US12167100B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12167100-B2 |
| Application number | US-202017292627-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 23, 2020 |
| Priority date | Mar 21, 2019 |
| Publication date | Dec 10, 2024 |
| Grant date | Dec 10, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of the present disclosure provide a method, an apparatus, a device, and a medium for generating captioning information of multimedia data. The method includes extracting characteristic information of multimedia data to be processed, wherein the multimedia data comprises a video or an image; and generating a text caption of the multimedia data based on the extracted characteristic information. According to the method provided in the embodiments of the present disclosure, the accuracy of the generated text caption of the multimedia data can be effectively improved.
Opening claim text (preview).
The invention claimed is: 1. A method for generating captioning information of multimedia data, comprising: extracting characteristic information of multimedia data to be processed, wherein the multimedia data comprises a video or an image; and generating a text caption of the multimedia data based on the extracted characteristic information, wherein the generating the text caption of the multimedia data based on the extracted characteristic information comprises: determining a first application scenario for the multimedia data by analyzing the multimedia data; obtaining length information of the text caption to be generated for the first application scenario, wherein the length information indicates at least one length of a plurality of lengths that respectively correspond to a plurality of application scenarios including the first application scenario; and generating the text caption based on the length information and the extracted characteristic information. 2. The method of claim 1 , wherein the extracting characteristic information of the multimedia data to be processed comprises at least one of the following: extracting local visual features of targets contained in respective target regions of each image in the multimedia data; extracting semantic features of the multimedia data; extracting spatial-temporal visual features of the multimedia data when the multimedia data is a video; extracting global visual features of the multimedia data; extracting attribute features of the targets contained in the respective target regions of each image in the multimedia data; and extracting global attribute features of each image in the multimedia data. 3. The method of claim 2 , wherein the characteristic information comprises the local visual features of the targets contained in respective target regions in each image of the multimedia data, and the generating the text caption of the multimedia data based on the extracted characteristic information, comprising: obtaining relationship features between the targets based on the local visual features of each target in the image; constructing a scene graph of the image based on the local visual features and the relationship features; obtaining graph convolution features of the image based on the scene graph of the image; and generating the text caption of the multimedia data based on the graph convolution features of each image of the multimedia data. 4. The method of claim 3 , wherein the scene graph comprises a plurality of nodes and a plurality of edges, wherein one node represents a local visual feature of one target, and each of the plurality of edges represents the relationship feature between two connected nodes. 5. The method of claim 3 , wherein the characteristic information comprises the attribute features of the targets contained in respective target regions of each image in the multimedia data; the constructing of the scene graph of the image based on the local visual features and the relationship features comprises: constructing the scene graph of the image based on the local visual features of each target, the relationship features between the targets, and the attribute features of each target, wherein one node in the scene graph represents the local visual features or attribute features of one target. 6. The method of claim 3 , wherein, when the multimedia data is the video, the images of the multimedia data are a plurality of frames selected from the video, and when the target regions of two adjacent frames comprise the same targets, the scene graphs of the two adjacent frames have temporal edges between the nodes corresponding to the same target. 7. The method of claim 3 , wherein the obtaining the graph convolution features of the image based on the scene graph of the image comprises: obtaining a target dimension of feature vector by encoding nodes and edges in the scene graph; and obtaining the graph convolution features by using a graph convolution network based on the obtained feature vector. 8. The method of claim 2 , wherein when the characteristic information of the multimedia data comprises at least two of the local visual feature, the semantic feature, the spatial-temporal visual feature, and the global feature, the generating the text caption of the multimedia data based on the extracted characteristic information comprises: determining weights of each characteristic information; weighting each characteristic information based on the weights of each characteristic information; and generating the text caption of the multimedia data based on the weighted characteristic information. 9. The method of claim 2 , wherein the generating the text caption of the multimedia data based on the extracted characteristic information comprises: encoding the obtained characteristic information by using self-attention-based encoder; inputting the encoded characteristic information to a decoder to generate the text caption of the multimedia data; wherein when the multimedia data is an image, the self-attention-based encoder is a self-attention-based intra-frame encoder; when the multimedia data is a video, the self-attention-based encoder comprises a self-attention-based intra-frame encoder and/or a self-attention-based inter-frame encoder. 10. The method of claim 1 , wherein the generating the text caption of the multimedia data based on the extracted characteristic information comprises: inputting the extracted characteristic information into a plurality of decoders, respectively; and generating the text caption of the multimedia data based on decoding results of the decoders. 11. The method of claim 1 , wherein the text caption of the multimedia data is generated through a multimedia data captioning model, wherein the multimedia data captioning model is obtained by training in the following manner: obtaining training samples, wherein the training samples comprise a first sample multimedia data with captioning labels; training an initial captioning model based on the first sample multimedia data until a model loss function converges; and taking the trained captioning model as the multimedia data captioning model. 12. The method of claim 11 , wherein the training samples further comprise a second sample multimedia data without the captioning labels, and the model loss function comprises a first loss function and a second loss function; the training the initial captioning model based on the first sample multimedia data until the model loss function converges comprises: training a preset captioning model based on the first sample multimedia data to obtain a value of the first loss function, and training the captioning model based on the second sample multimedia data to obtain a value of the second loss function; obtaining a value of the final loss function based on the value of the first loss function and the value of the second loss function; and training the captioning model based on the value of the final loss function until the final loss function converges. 13. An apparatus for generating captioning information of multimedia data, comprising: a memory storing one or more instructions; and a processor configured to execute the one or more instructions stored in the memory to: extract characteristic information of multimedia data to be processed, wherein the multimedia data comprises a video or an image; determine a first application scenario for the multimedia data by analyzing the multimedia data; obtain length information of the text caption to be generated for the first application scenario, wherein the length information indicates at least one length
Adversarial learning · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Generative networks · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.