Image processing apparatus and 3D model generation method
US-12148211-B2 · Nov 19, 2024 · US
US2022012499A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022012499-A1 |
| Application number | US-202016926124-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jul 10, 2020 |
| Priority date | Jul 10, 2020 |
| Publication date | Jan 13, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for generating a grounded video description for a video input are provided. Hierarchical Attention based Spatial-Temporal Graph-to-Sequence Learning framework for producing a GVD is provided by generating an initial graph representing a plurality of object features in a plurality of frames of a received video input and generating an implicit graph for the plurality of object features in the plurality of frames using a similarity function. The initial graph and the implicit graph are combined to form a refined graph and the refined graph is processed using attention processes, to generate an attended hierarchical graph of the plurality of object features for the plurality of frames. The grounded video description is generated for the received video input using at least the hierarchical graph of the plurality of features.
Opening claim text (preview).
What is claimed is: 1 . A method comprising: generating an initial graph representing a plurality of object features in a plurality of frames of a video input; generating an implicit graph for the plurality of object features in the plurality of frames using a similarity function; combining the initial graph and the implicit graph to form a refined graph; processing the refined graph, using attention processes, to generate an attended hierarchical graph of the plurality of object features for the plurality of frames; and generating a grounded video description for the received video input using at least the hierarchical graph. 2 . The method of claim 1 , wherein the initial graph comprises a plurality of subgraphs, wherein each subgraph of the plurality of subgraphs is associated with a frame of the plurality of frames and wherein generating the initial graph comprises: determining a plurality of object feature proposals for each subgraph; classifying each object feature proposal in each subgraph based on spatial information in the subgraph; and adding a temporal relationship edge between object features present in more than one subgraph. 3 . The method of claim 2 , wherein generating the initial graph comprises utilizing one of a k-nearest neighbor (KNN) algorithm or a pre-trained relation classifier. 4 . The method of claim 2 , wherein generating an implicit graph for the plurality of object features in the plurality of frames comprises: determining, using a weighted similarity function, a weight adjustment for each object feature proposal; and adding a temporal relationship edge between object features present in more than one subgraph based on the weight adjustment of the object feature proposal. 5 . The method of claim 1 , wherein combining the initial graph and the implicit graph to form a refined graph further comprises: aggregating object features present in a plurality of subgraphs of the refined graph to represent deep correlations of the object features in the refined graph. 6 . The method of claim 1 , wherein processing the refined graph to generate a hierarchical graph of the plurality of features for the plurality of frames comprises: determining a vector representation for each subgraph in the refined graph; calculating an attention score for each of the vector representations; calculating an attention score for each object feature in each subgraph; and producing a graph feature based on the attention scores. 7 . The method of claim 6 , wherein generating the grounded video description comprises: applying the graph feature in a language long short-term memory (LSTM) algorithm to determine the inclusion of a word associated with an object feature in the grounded video description. 8 . A system comprising one or more computer processors and a memory containing a program which when executed by the computer processors performs an operation comprising: generating an initial graph representing a plurality of object features in a plurality of frames of a received video input; generating an implicit graph for the plurality of object features in the plurality of frames using a similarity function; combining the initial graph and the implicit graph to form a refined graph; processing the refined graph, using attention processes, to generate an attended hierarchical graph of the plurality of object features for the plurality of frames; and generating a grounded video description for the received video input using at least the hierarchical graph of the plurality of features. 9 . The system of claim 8 , wherein the initial graph comprises a plurality of subgraphs, wherein each subgraph of the plurality of subgraphs is associated with a frame of the plurality of frames and wherein generating the initial graph comprises: determining a plurality of object feature proposals for each subgraph; classifying each object feature proposal in each subgraph based on spatial information in the subgraph; and adding a temporal relationship edge between object features present in more than one subgraph. 10 . The system of claim 9 , wherein generating the initial graph comprises utilizing one of a k-nearest neighbor (KNN) algorithm or a pre-trained relation classifier. 11 . The system of claim 9 , wherein generating an implicit graph for the plurality of object features in the plurality of frames comprises: determining, using a weighted similarity function, a weight adjustment for each object feature proposal; and adding a temporal relationship edge between object features present in more than one subgraph based on the weight adjustment of the object feature proposal. 12 . The system of claim 8 , wherein combining the initial graph and the implicit graph to form a refined graph further comprises: aggregating object features present in a plurality of subgraphs of the refined graph to represent deep correlations of the object features in the refined graph. 13 . The system of claim 8 , wherein processing the refined graph to generate a hierarchical graph of the plurality of features for the plurality of frames comprises: determining a vector representation for each subgraph in the refined graph; calculating an attention score for each of the vector representations; calculating an attention score for each object feature in each subgraph; and producing a graph feature based on the attention scores. 14 . The system of claim 13 , wherein generating the grounded video description comprises: applying the graph feature in a language long short-term memory (LSTM) algorithm to determine the inclusion of a word associated with an object feature in the grounded video description. 15 . A computer program product for water bottle rental and sharing, the computer program product comprising: a computer-readable storage medium having computer-readable program code embodied therewith, the computer-readable program code executable by one or more computer processors to perform an operation comprising: generating an initial graph representing a plurality of object features in a plurality of frames of a received video input; generating an implicit graph for the plurality of object features in the plurality of frames using a similarity function; combining the initial graph and the implicit graph to form a refined graph; processing the refined graph, using attention processes, to generate an attended hierarchical graph of the plurality of object features for the plurality of frames; and generating a grounded video description for the received video input using at least the hierarchical graph of the plurality of features. 16 . The computer program product of claim 15 , wherein the initial graph comprises a plurality of subgraphs, wherein each subgraph of the plurality of subgraphs is associated with a frame of the plurality of frames and wherein generating the initial graph comprises: determining a plurality of object feature proposals for each subgraph; classifying each object feature proposal in each subgraph based on spatial information in the subgraph; and adding a temporal relationship edge between object features present in more than one subgraph. 17 . The computer program product of claim 16 , wherein generating an implicit graph for the plurality of object features in the plurality of frames comprises: determining, using a weighted similarity function, a weight adjustment for each object feature proposal; and adding a temporal relationship edge between object features present in more than one subgraph based o
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking · CPC title
Proximity, similarity or dissimilarity measures · CPC title
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.