Active learning method for temporal action localization in untrimmed videos
US-2019325275-A1 · Oct 24, 2019 · US
US10949718B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10949718-B2 |
| Application number | US-201916406380-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 8, 2019 |
| Priority date | May 8, 2019 |
| Publication date | Mar 16, 2021 |
| Grant date | Mar 16, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The systems and methods described herein may generate multi-modal embeddings with sub-symbolic features and symbolic features. The sub-symbolic embeddings may be generated with computer vision processing. The symbolic features may include mathematical representations of image content, which are enriched with information from background knowledge sources. The system may aggregate the sub-symbolic and symbolic features using aggregation techniques such as concatenation, averaging, summing, and/or maxing. The multi-modal embeddings may be included in a multi-modal embedding model and trained via supervised learning. Once the multi-modal embeddings are trained, the system may generate inferences based on linear algebra operations involving the multi-modal embeddings that are relevant to an inference response to the natural language question and input image.
Opening claim text (preview).
What is claimed is: 1. A method for visual question inference, the method comprising: receiving an input image and a natural language query; determining content classifications for portions of the input image; generating a scene graph for the input image, the scene graph including the content classifications arranged in a graph data structure comprising nodes and edges, the nodes respectively representative of the content classifications for the input image and the edges representative of relationships between the content classifications; generating multi-modal embeddings based on the input image and the scene graph, the multi-modal embeddings being respectively associated with the nodes, the edges, or any combination thereof, wherein at least a portion of the multi-modal embeddings are generated by: determining symbolic embeddings for the content classifications of the input image, the symbolic embeddings representative of nodes of the scene graph, edges of the scene graph, or any combination thereof; determining a sub-symbolic embedding for the input image, the sub-symbolic embedding comprising an image feature vector for the input image; identifying separate portions of the image feature vector that are representative of the portions of the input image; generating weighted sub-symbolic embeddings for each of the content classifications by applying weight values to the separate portions of the image feature vector; aggregating the symbolic embeddings with the weighted sub-symbolic embeddings to form at least the portion of the multi-modal embeddings; generating a natural language response to the natural language query based on the multi-modal embeddings by: generating an inference query based on the natural language query, the inference query indicative of the at least one of the content classifications; selecting, from the multi-modal embeddings, particular multi-modal embeddings associated with at least one of the content classifications; determining an inference statement based on a distance measurement between the particular multi-modal embeddings; and determining the natural language response based on the inference statement; and displaying, in response to receipt of the natural language query and the input image, the natural language response. 2. The method of claim 1 , wherein aggregating the symbolic embeddings with the weighted sub-symbolic embeddings to form the multi-modal embeddings further comprises: concatenating a first vector from the symbolic embeddings with a second vector from the weighted sub-symbolic embeddings to form a multi-modal vector. 3. The method of claim 1 , wherein determining an inference statement based on a distance measurement between the particular multi-modal embeddings further comprises: generating a plurality of candidate statements, each of the candidate statements referencing at least one node and at least one edge of the scene graph; selecting, from the multi-modal embeddings, groups of multi-modal embeddings based on the at least one node and the at least one edge of the scene graph; determining respective scores for the plurality of candidate statements based on distance measurements between multi-modal embeddings in each of the groups; selecting, based on the respective scores, at least one of the candidate statements; and generating the natural language response based on the selected at least one of the selected candidate statements. 4. The method of claim 3 , wherein selecting, based on the respective scores, at least one of the candidate statements further comprises: selecting a candidate statement associated with a highest one of the respective scores. 5. The method of claim 1 , further comprising: enriching the scene graph by appending additional nodes to the scene graph with nodes being sourced from a background knowledge graph; and generating the multi-modal embeddings based on the input image and the enriched scene graph. 6. The method of claim 5 , wherein enriching the scene graph by appending additional nodes to the scene graph with the additional nodes being sourced from a background knowledge graph further comprises: identifying, in the background knowledge graph, a first node of the scene graph that corresponds to a second node of the background knowledge graph; selecting further nodes of the background knowledge graph that are connected with the second node of the background knowledge graph, wherein the selected further nodes are not included in the non-enriched scene graph; and appending the selected further nodes to the scene graph. 7. The method of claim 1 , further comprising: generating a graphical user interface, the graphical user interface comprising the input image and a text field; determining that the natural language query was inserted into the text field; and updating the graphical user interface to include the natural language response. 8. A system for visual question inference, the system comprising: a processor, the processor configured to: receive an input image and a natural language query; generate a scene graph for the input image, the scene graph comprising content classifications of image data for the input image, the content classifications being arranged in a graph data structure comprising nodes and edges, the nodes respectively representative of the content classifications for the input image and the edges representative of relationships between the content classifications; determine symbolic embeddings for the input image, the symbolic embeddings representative of nodes of the scene graph, edges of the scene graph, or any combination thereof; determine a sub-symbolic embeddings for the input image, the sub-symbolic embeddings comprising respective image feature vectors for the input image; aggregate the symbolic embeddings with the sub-symbolic embeddings to form multi-modal embeddings in a multi-modal embedding space; identify at least one node and at least one edge in the scene graph based on natural language included in the natural language query, the natural language text being indicative of at least one of the content classifications; select, from the multi-modal embeddings, particular multi-modal embeddings associated with the at least one of the content classifications; determine an inference statement based on a distance measurement between the selected multi-modal embeddings; generate a natural language response based on the inference statement; and display the natural language in response on a graphical user interface. 9. The system of claim 8 , wherein to determine a sub-symbolic embeddings for the input image, the processor is further configured to: determine at least a portion of the input image that corresponds to the content classifications; generate an initial image feature vector for the input image; identify separate portions of the initial image feature vector, the separate portions of the initial image feature vector being representative of the at least the portion of the input image; apply weight values to the separate portions of the image feature vector; extract the separate weighted portions of the image feature vector; and generate the respective image feature vectors of the sub-symbolic embeddings, wherein each of the respective image feature vectors comprise a corresponding one of the separate weighted portions of the image feature vector. 10. The system of claim 8 , wherein to aggregate the symbolic embeddings with the sub-symbolic embeddings to form multi-modal embeddings in a multi-modal embedding space, the processor is further configured to: concatenate a first feature vector from the symbolic embeddings with a second feature v
Lexical analysis, e.g. tokenisation or collocates · CPC title
Classification techniques · CPC title
Syntactic or semantic context, e.g. balancing · CPC title
Classification techniques · CPC title
Natural language generation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.