What technology area does this patent fall under?

Primary CPC classification H04N21/4884. Mapped technology areas include Electricity.

When was this patent published?

Publication date Tue Dec 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method, apparatus, device and medium for generating captioning information of multimedia data

US12167100B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12167100-B2
Application number	US-202017292627-A
Country	US
Kind code	B2
Filing date	Mar 23, 2020
Priority date	Mar 21, 2019
Publication date	Dec 10, 2024
Grant date	Dec 10, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of the present disclosure provide a method, an apparatus, a device, and a medium for generating captioning information of multimedia data. The method includes extracting characteristic information of multimedia data to be processed, wherein the multimedia data comprises a video or an image; and generating a text caption of the multimedia data based on the extracted characteristic information. According to the method provided in the embodiments of the present disclosure, the accuracy of the generated text caption of the multimedia data can be effectively improved.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for generating captioning information of multimedia data, comprising: extracting characteristic information of multimedia data to be processed, wherein the multimedia data comprises a video or an image; and generating a text caption of the multimedia data based on the extracted characteristic information, wherein the generating the text caption of the multimedia data based on the extracted characteristic information comprises: determining a first application scenario for the multimedia data by analyzing the multimedia data; obtaining length information of the text caption to be generated for the first application scenario, wherein the length information indicates at least one length of a plurality of lengths that respectively correspond to a plurality of application scenarios including the first application scenario; and generating the text caption based on the length information and the extracted characteristic information. 2. The method of claim 1 , wherein the extracting characteristic information of the multimedia data to be processed comprises at least one of the following: extracting local visual features of targets contained in respective target regions of each image in the multimedia data; extracting semantic features of the multimedia data; extracting spatial-temporal visual features of the multimedia data when the multimedia data is a video; extracting global visual features of the multimedia data; extracting attribute features of the targets contained in the respective target regions of each image in the multimedia data; and extracting global attribute features of each image in the multimedia data. 3. The method of claim 2 , wherein the characteristic information comprises the local visual features of the targets contained in respective target regions in each image of the multimedia data, and the generating the text caption of the multimedia data based on the extracted characteristic information, comprising: obtaining relationship features between the targets based on the local visual features of each target in the image; constructing a scene graph of the image based on the local visual features and the relationship features; obtaining graph convolution features of the image based on the scene graph of the image; and generating the text caption of the multimedia data based on the graph convolution features of each image of the multimedia data. 4. The method of claim 3 , wherein the scene graph comprises a plurality of nodes and a plurality of edges, wherein one node represents a local visual feature of one target, and each of the plurality of edges represents the relationship feature between two connected nodes. 5. The method of claim 3 , wherein the characteristic information comprises the attribute features of the targets contained in respective target regions of each image in the multimedia data; the constructing of the scene graph of the image based on the local visual features and the relationship features comprises: constructing the scene graph of the image based on the local visual features of each target, the relationship features between the targets, and the attribute features of each target, wherein one node in the scene graph represents the local visual features or attribute features of one target. 6. The method of claim 3 , wherein, when the multimedia data is the video, the images of the multimedia data are a plurality of frames selected from the video, and when the target regions of two adjacent frames comprise the same targets, the scene graphs of the two adjacent frames have temporal edges between the nodes corresponding to the same target. 7. The method of claim 3 , wherein the obtaining the graph convolution features of the image based on the scene graph of the image comprises: obtaining a target dimension of feature vector by encoding nodes and edges in the scene graph; and obtaining the graph convolution features by using a graph convolution network based on the obtained feature vector. 8. The method of claim 2 , wherein when the characteristic information of the multimedia data comprises at least two of the local visual feature, the semantic feature, the spatial-temporal visual feature, and the global feature, the generating the text caption of the multimedia data based on the extracted characteristic information comprises: determining weights of each characteristic information; weighting each characteristic information based on the weights of each characteristic information; and generating the text caption of the multimedia data based on the weighted characteristic information. 9. The method of claim 2 , wherein the generating the text caption of the multimedia data based on the extracted characteristic information comprises: encoding the obtained characteristic information by using self-attention-based encoder; inputting the encoded characteristic information to a decoder to generate the text caption of the multimedia data; wherein when the multimedia data is an image, the self-attention-based encoder is a self-attention-based intra-frame encoder; when the multimedia data is a video, the self-attention-based encoder comprises a self-attention-based intra-frame encoder and/or a self-attention-based inter-frame encoder. 10. The method of claim 1 , wherein the generating the text caption of the multimedia data based on the extracted characteristic information comprises: inputting the extracted characteristic information into a plurality of decoders, respectively; and generating the text caption of the multimedia data based on decoding results of the decoders. 11. The method of claim 1 , wherein the text caption of the multimedia data is generated through a multimedia data captioning model, wherein the multimedia data captioning model is obtained by training in the following manner: obtaining training samples, wherein the training samples comprise a first sample multimedia data with captioning labels; training an initial captioning model based on the first sample multimedia data until a model loss function converges; and taking the trained captioning model as the multimedia data captioning model. 12. The method of claim 11 , wherein the training samples further comprise a second sample multimedia data without the captioning labels, and the model loss function comprises a first loss function and a second loss function; the training the initial captioning model based on the first sample multimedia data until the model loss function converges comprises: training a preset captioning model based on the first sample multimedia data to obtain a value of the first loss function, and training the captioning model based on the second sample multimedia data to obtain a value of the second loss function; obtaining a value of the final loss function based on the value of the first loss function and the value of the second loss function; and training the captioning model based on the value of the final loss function until the final loss function converges. 13. An apparatus for generating captioning information of multimedia data, comprising: a memory storing one or more instructions; and a processor configured to execute the one or more instructions stored in the memory to: extract characteristic information of multimedia data to be processed, wherein the multimedia data comprises a video or an image; determine a first application scenario for the multimedia data by analyzing the multimedia data; obtain length information of the text caption to be generated for the first application scenario, wherein the length information indicates at least one length

Assignees

Samsung Electronics Co Ltd

Inventors

Classifications

G06N3/094
Adversarial learning · CPC title
G06N3/0455
Auto-encoder networks; Encoder-decoder networks · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/0475
Generative networks · CPC title
G06N3/0895
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

Patent family

Related publications grouped by family.

View patent family 72521146

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12167100B2 cover?: Embodiments of the present disclosure provide a method, an apparatus, a device, and a medium for generating captioning information of multimedia data. The method includes extracting characteristic information of multimedia data to be processed, wherein the multimedia data comprises a video or an image; and generating a text caption of the multimedia data based on the extracted characteristic in…
Who is the assignee on this patent?: Samsung Electronics Co Ltd
What technology area does this patent fall under?: Primary CPC classification H04N21/4884. Mapped technology areas include Electricity.
When was this patent published?: Publication date Tue Dec 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).