Attention Bottlenecks for Multimodal Fusion
US-2023177384-A1 · Jun 8, 2023 · US
US12536836B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12536836-B2 |
| Application number | US-202318152627-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 10, 2023 |
| Priority date | Jan 10, 2023 |
| Publication date | Jan 27, 2026 |
| Grant date | Jan 27, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Implementations include actions of receiving an image; extracting a visual HOI and a set of visual embeddings, the visual HOI indicating a subject and an object; obtaining, using a vector library, a set of semantic HOIs and sets of semantic embeddings based on the subject, the object and a set of verbs included in the vector library, each set of semantic embeddings corresponding to a semantic HOI; processing, by a compositional model, the set of visual embeddings to provide a set of transition visual embeddings; processing the sets of semantic embeddings to provide respective sets of transition semantic embeddings; determining a set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings, each score representing a degree of similarity between the visual HOI and a semantic HOI; and determining at least one predicted HOI represented within the image based on the scores.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method for determining human-object interactions (HOIs) in images, the method comprising: receiving an image; extracting, by a feature embedding model, a visual HOI and a set of visual embeddings, the visual HOI indicating a subject and an object; obtaining, using a vector library, a set of semantic HOIs and sets of semantic embeddings based on the subject, the object and a set of verbs included in the vector library, each set of semantic embeddings corresponding to a semantic HOI; processing, by a compositional model, the set of visual embeddings to provide a set of transition visual embeddings; processing, by the compositional model, the sets of semantic embeddings to provide respective sets of transition semantic embeddings; determining a set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings, each score representing a degree of similarity between the visual HOI and a semantic HOI of the set of semantic HOIs, wherein determining the set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings comprises, for each set of transition semantic embeddings: determining a first distance between a transition visual embedding representative of the subject and a transition semantic embedding representative of the subject, determining a second distance between a transition visual embedding representative of the object and a transition semantic embedding representative of the object, determining a third distance between a transition visual embedding representative of bounding boxes and a transition semantic embedding representative of a verb, the bounding boxes bounding the subject and the object within the image, determining an aggregate distance using the first distance, the second distance, and the third distance; and determining at least one predicted HOI represented within the image based on the scores. 2 . The method of claim 1 , wherein the compositional model comprises a set of visual models and a set of language models, the set of visual models comprising a subject visual model, an object visual model, and a union visual model, and the set of language models comprising a subject language model, an object language model, and a verb language model. 3 . The method of claim 1 , wherein the vector library comprises a set of subject word embeddings, a set of object word embeddings, and a set of verb word embeddings. 4 . The method of claim 1 , wherein the vector library comprises word embeddings generated by processing labels of training data using a word embedding model, the training data being used to train the compositional model. 5 . The method of claim 1 , wherein determining at least one predicted HOI comprises: identifying a semantic HOI as having a highest score; and providing the semantic HOI with the highest score as the at least one predicted HOI. 6 . The method of claim 1 , wherein each of the first distance, the second distance, and the third distance is at least partially determined as a cosine distance. 7 . A system, comprising: one or more processors; and a computer-readable storage device coupled to the one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations for determining human-object interactions (HOIs) in images, the operations comprising: receiving an image; extracting, by a feature embedding model, a visual HOI and a set of visual embeddings, the visual HOI indicating a subject and an object; obtaining, using a vector library, a set of semantic HOIs and sets of semantic embeddings based on the subject, the object and a set of verbs included in the vector library, each set of semantic embeddings corresponding to a semantic HOI; processing, by a compositional model, the set of visual embeddings to provide a set of transition visual embeddings; processing, by the compositional model, the sets of semantic embeddings to provide respective sets of transition semantic embeddings; determining a set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings, each score representing a degree of similarity between the visual HOI and a semantic HOI of the set of semantic HOIs, wherein determining the set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings comprises, for each set of transition semantic embeddings: determining a first distance between a transition visual embedding representative of the subject and a transition semantic embedding representative of the subject, determining a second distance between transition visual embedding representative of the object and a transition semantic embedding representative of the object, determining a third distance between a transition visual embedding representative of bounding boxes and a transition semantic embedding representative of a verb, the bounding boxes bounding the subject and the object within the image, determining an aggregate distance using the first distance, the second distance, and the third distance; and determining at least one predicted HOI represented within the image based on the scores. 8 . The system of claim 7 , wherein the compositional model comprises a set of visual models and a set of language models, the set of visual models comprising a subject visual model, an object visual model, and a union visual model, and the set of language models comprising a subject language model, an object language model, and a verb language model. 9 . The system of claim 7 , wherein the vector library comprises a set of subject word embeddings, a set of object word embeddings, and a set of verb word embeddings. 10 . The system of claim 7 , wherein the vector library comprises word embeddings generated by processing labels of training data using a word embedding model, the training data being used to train the compositional model. 11 . The system of claim 7 , wherein determining at least one predicted HOI comprises: identifying a semantic HOI as having a highest score; and providing the semantic HOI with the highest score as the at least one predicted HOI. 12 . The system of claim 7 , wherein each of the first distance, the second distance, and the third distance is at least partially determined as a cosine distance. 13 . A non-transitory computer-readable storage media coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations for determining human-object interactions (HOIs) in images, the operations comprising: receiving an image; extracting, by a feature embedding model, a visual HOI and a set of visual embeddings, the visual HOI indicating a subject and an object; obtaining, using a vector library, a set of semantic HOIs and sets of semantic embeddings based on the subject, the object and a set of verbs included in the vector library, each set of semantic embeddings corresponding to a semantic HOI; processing, by a compositional model, the set of visual embeddings to provide a set of transition visual embeddings; processing, by the compositional model, the sets of semantic embeddings to provide respective sets of transition semantic embeddings; determining a set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings, each score representing a degree of similarity between the visual HOI and a semantic H
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Proximity, similarity or dissimilarity measures · CPC title
Semantic analysis · CPC title
Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.