What technology area does this patent fall under?

Primary CPC classification G06V20/41. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jan 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Spatial-temporal graph-to-sequence learning based grounded video descriptions

US2022012499A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2022012499-A1
Application number	US-202016926124-A
Country	US
Kind code	A1
Filing date	Jul 10, 2020
Priority date	Jul 10, 2020
Publication date	Jan 13, 2022
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for generating a grounded video description for a video input are provided. Hierarchical Attention based Spatial-Temporal Graph-to-Sequence Learning framework for producing a GVD is provided by generating an initial graph representing a plurality of object features in a plurality of frames of a received video input and generating an implicit graph for the plurality of object features in the plurality of frames using a similarity function. The initial graph and the implicit graph are combined to form a refined graph and the refined graph is processed using attention processes, to generate an attended hierarchical graph of the plurality of object features for the plurality of frames. The grounded video description is generated for the received video input using at least the hierarchical graph of the plurality of features.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: generating an initial graph representing a plurality of object features in a plurality of frames of a video input; generating an implicit graph for the plurality of object features in the plurality of frames using a similarity function; combining the initial graph and the implicit graph to form a refined graph; processing the refined graph, using attention processes, to generate an attended hierarchical graph of the plurality of object features for the plurality of frames; and generating a grounded video description for the received video input using at least the hierarchical graph. 2 . The method of claim 1 , wherein the initial graph comprises a plurality of subgraphs, wherein each subgraph of the plurality of subgraphs is associated with a frame of the plurality of frames and wherein generating the initial graph comprises: determining a plurality of object feature proposals for each subgraph; classifying each object feature proposal in each subgraph based on spatial information in the subgraph; and adding a temporal relationship edge between object features present in more than one subgraph. 3 . The method of claim 2 , wherein generating the initial graph comprises utilizing one of a k-nearest neighbor (KNN) algorithm or a pre-trained relation classifier. 4 . The method of claim 2 , wherein generating an implicit graph for the plurality of object features in the plurality of frames comprises: determining, using a weighted similarity function, a weight adjustment for each object feature proposal; and adding a temporal relationship edge between object features present in more than one subgraph based on the weight adjustment of the object feature proposal. 5 . The method of claim 1 , wherein combining the initial graph and the implicit graph to form a refined graph further comprises: aggregating object features present in a plurality of subgraphs of the refined graph to represent deep correlations of the object features in the refined graph. 6 . The method of claim 1 , wherein processing the refined graph to generate a hierarchical graph of the plurality of features for the plurality of frames comprises: determining a vector representation for each subgraph in the refined graph; calculating an attention score for each of the vector representations; calculating an attention score for each object feature in each subgraph; and producing a graph feature based on the attention scores. 7 . The method of claim 6 , wherein generating the grounded video description comprises: applying the graph feature in a language long short-term memory (LSTM) algorithm to determine the inclusion of a word associated with an object feature in the grounded video description. 8 . A system comprising one or more computer processors and a memory containing a program which when executed by the computer processors performs an operation comprising: generating an initial graph representing a plurality of object features in a plurality of frames of a received video input; generating an implicit graph for the plurality of object features in the plurality of frames using a similarity function; combining the initial graph and the implicit graph to form a refined graph; processing the refined graph, using attention processes, to generate an attended hierarchical graph of the plurality of object features for the plurality of frames; and generating a grounded video description for the received video input using at least the hierarchical graph of the plurality of features. 9 . The system of claim 8 , wherein the initial graph comprises a plurality of subgraphs, wherein each subgraph of the plurality of subgraphs is associated with a frame of the plurality of frames and wherein generating the initial graph comprises: determining a plurality of object feature proposals for each subgraph; classifying each object feature proposal in each subgraph based on spatial information in the subgraph; and adding a temporal relationship edge between object features present in more than one subgraph. 10 . The system of claim 9 , wherein generating the initial graph comprises utilizing one of a k-nearest neighbor (KNN) algorithm or a pre-trained relation classifier. 11 . The system of claim 9 , wherein generating an implicit graph for the plurality of object features in the plurality of frames comprises: determining, using a weighted similarity function, a weight adjustment for each object feature proposal; and adding a temporal relationship edge between object features present in more than one subgraph based on the weight adjustment of the object feature proposal. 12 . The system of claim 8 , wherein combining the initial graph and the implicit graph to form a refined graph further comprises: aggregating object features present in a plurality of subgraphs of the refined graph to represent deep correlations of the object features in the refined graph. 13 . The system of claim 8 , wherein processing the refined graph to generate a hierarchical graph of the plurality of features for the plurality of frames comprises: determining a vector representation for each subgraph in the refined graph; calculating an attention score for each of the vector representations; calculating an attention score for each object feature in each subgraph; and producing a graph feature based on the attention scores. 14 . The system of claim 13 , wherein generating the grounded video description comprises: applying the graph feature in a language long short-term memory (LSTM) algorithm to determine the inclusion of a word associated with an object feature in the grounded video description. 15 . A computer program product for water bottle rental and sharing, the computer program product comprising: a computer-readable storage medium having computer-readable program code embodied therewith, the computer-readable program code executable by one or more computer processors to perform an operation comprising: generating an initial graph representing a plurality of object features in a plurality of frames of a received video input; generating an implicit graph for the plurality of object features in the plurality of frames using a similarity function; combining the initial graph and the implicit graph to form a refined graph; processing the refined graph, using attention processes, to generate an attended hierarchical graph of the plurality of object features for the plurality of frames; and generating a grounded video description for the received video input using at least the hierarchical graph of the plurality of features. 16 . The computer program product of claim 15 , wherein the initial graph comprises a plurality of subgraphs, wherein each subgraph of the plurality of subgraphs is associated with a frame of the plurality of frames and wherein generating the initial graph comprises: determining a plurality of object feature proposals for each subgraph; classifying each object feature proposal in each subgraph based on spatial information in the subgraph; and adding a temporal relationship edge between object features present in more than one subgraph. 17 . The computer program product of claim 16 , wherein generating an implicit graph for the plurality of object features in the plurality of frames comprises: determining, using a weighted similarity function, a weight adjustment for each object feature proposal; and adding a temporal relationship edge between object features present in more than one subgraph based o

Assignees

Inventors

Classifications

G06V20/41Primary
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
G06V10/62
relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking · CPC title
G06V10/761
Proximity, similarity or dissimilarity measures · CPC title
G06V20/46
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title

Patent family

Related publications grouped by family.

View patent family 79172653

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2022012499A1 cover?: Techniques for generating a grounded video description for a video input are provided. Hierarchical Attention based Spatial-Temporal Graph-to-Sequence Learning framework for producing a GVD is provided by generating an initial graph representing a plurality of object features in a plurality of frames of a received video input and generating an implicit graph for the plurality of object features…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06V20/41. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jan 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).