Human object interaction detection using compositional model

US12536836B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12536836-B2
Application numberUS-202318152627-A
CountryUS
Kind codeB2
Filing dateJan 10, 2023
Priority dateJan 10, 2023
Publication dateJan 27, 2026
Grant dateJan 27, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations include actions of receiving an image; extracting a visual HOI and a set of visual embeddings, the visual HOI indicating a subject and an object; obtaining, using a vector library, a set of semantic HOIs and sets of semantic embeddings based on the subject, the object and a set of verbs included in the vector library, each set of semantic embeddings corresponding to a semantic HOI; processing, by a compositional model, the set of visual embeddings to provide a set of transition visual embeddings; processing the sets of semantic embeddings to provide respective sets of transition semantic embeddings; determining a set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings, each score representing a degree of similarity between the visual HOI and a semantic HOI; and determining at least one predicted HOI represented within the image based on the scores.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method for determining human-object interactions (HOIs) in images, the method comprising: receiving an image; extracting, by a feature embedding model, a visual HOI and a set of visual embeddings, the visual HOI indicating a subject and an object; obtaining, using a vector library, a set of semantic HOIs and sets of semantic embeddings based on the subject, the object and a set of verbs included in the vector library, each set of semantic embeddings corresponding to a semantic HOI; processing, by a compositional model, the set of visual embeddings to provide a set of transition visual embeddings; processing, by the compositional model, the sets of semantic embeddings to provide respective sets of transition semantic embeddings; determining a set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings, each score representing a degree of similarity between the visual HOI and a semantic HOI of the set of semantic HOIs, wherein determining the set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings comprises, for each set of transition semantic embeddings: determining a first distance between a transition visual embedding representative of the subject and a transition semantic embedding representative of the subject, determining a second distance between a transition visual embedding representative of the object and a transition semantic embedding representative of the object, determining a third distance between a transition visual embedding representative of bounding boxes and a transition semantic embedding representative of a verb, the bounding boxes bounding the subject and the object within the image, determining an aggregate distance using the first distance, the second distance, and the third distance; and determining at least one predicted HOI represented within the image based on the scores. 2 . The method of claim 1 , wherein the compositional model comprises a set of visual models and a set of language models, the set of visual models comprising a subject visual model, an object visual model, and a union visual model, and the set of language models comprising a subject language model, an object language model, and a verb language model. 3 . The method of claim 1 , wherein the vector library comprises a set of subject word embeddings, a set of object word embeddings, and a set of verb word embeddings. 4 . The method of claim 1 , wherein the vector library comprises word embeddings generated by processing labels of training data using a word embedding model, the training data being used to train the compositional model. 5 . The method of claim 1 , wherein determining at least one predicted HOI comprises: identifying a semantic HOI as having a highest score; and providing the semantic HOI with the highest score as the at least one predicted HOI. 6 . The method of claim 1 , wherein each of the first distance, the second distance, and the third distance is at least partially determined as a cosine distance. 7 . A system, comprising: one or more processors; and a computer-readable storage device coupled to the one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations for determining human-object interactions (HOIs) in images, the operations comprising: receiving an image; extracting, by a feature embedding model, a visual HOI and a set of visual embeddings, the visual HOI indicating a subject and an object; obtaining, using a vector library, a set of semantic HOIs and sets of semantic embeddings based on the subject, the object and a set of verbs included in the vector library, each set of semantic embeddings corresponding to a semantic HOI; processing, by a compositional model, the set of visual embeddings to provide a set of transition visual embeddings; processing, by the compositional model, the sets of semantic embeddings to provide respective sets of transition semantic embeddings; determining a set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings, each score representing a degree of similarity between the visual HOI and a semantic HOI of the set of semantic HOIs, wherein determining the set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings comprises, for each set of transition semantic embeddings: determining a first distance between a transition visual embedding representative of the subject and a transition semantic embedding representative of the subject, determining a second distance between transition visual embedding representative of the object and a transition semantic embedding representative of the object, determining a third distance between a transition visual embedding representative of bounding boxes and a transition semantic embedding representative of a verb, the bounding boxes bounding the subject and the object within the image, determining an aggregate distance using the first distance, the second distance, and the third distance; and determining at least one predicted HOI represented within the image based on the scores. 8 . The system of claim 7 , wherein the compositional model comprises a set of visual models and a set of language models, the set of visual models comprising a subject visual model, an object visual model, and a union visual model, and the set of language models comprising a subject language model, an object language model, and a verb language model. 9 . The system of claim 7 , wherein the vector library comprises a set of subject word embeddings, a set of object word embeddings, and a set of verb word embeddings. 10 . The system of claim 7 , wherein the vector library comprises word embeddings generated by processing labels of training data using a word embedding model, the training data being used to train the compositional model. 11 . The system of claim 7 , wherein determining at least one predicted HOI comprises: identifying a semantic HOI as having a highest score; and providing the semantic HOI with the highest score as the at least one predicted HOI. 12 . The system of claim 7 , wherein each of the first distance, the second distance, and the third distance is at least partially determined as a cosine distance. 13 . A non-transitory computer-readable storage media coupled to one or more processors and having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations for determining human-object interactions (HOIs) in images, the operations comprising: receiving an image; extracting, by a feature embedding model, a visual HOI and a set of visual embeddings, the visual HOI indicating a subject and an object; obtaining, using a vector library, a set of semantic HOIs and sets of semantic embeddings based on the subject, the object and a set of verbs included in the vector library, each set of semantic embeddings corresponding to a semantic HOI; processing, by a compositional model, the set of visual embeddings to provide a set of transition visual embeddings; processing, by the compositional model, the sets of semantic embeddings to provide respective sets of transition semantic embeddings; determining a set of scores based on the set of transition visual embeddings and the sets of transition semantic embeddings, each score representing a degree of similarity between the visual HOI and a semantic H

Assignees

Inventors

Classifications

  • Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

  • Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • G06V10/761Primary

    Proximity, similarity or dissimilarity measures · CPC title

  • Semantic analysis · CPC title

  • G06V40/20Primary

    Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12536836B2 cover?
Implementations include actions of receiving an image; extracting a visual HOI and a set of visual embeddings, the visual HOI indicating a subject and an object; obtaining, using a vector library, a set of semantic HOIs and sets of semantic embeddings based on the subject, the object and a set of verbs included in the vector library, each set of semantic embeddings corresponding to a semantic H…
Who is the assignee on this patent?
Accenture Global Solutions Ltd
What technology area does this patent fall under?
Primary CPC classification G06V10/761. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 27 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).