Keypoint based action localization

US12198397B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12198397-B2
Application numberUS-202217586284-A
CountryUS
Kind codeB2
Filing dateJan 27, 2022
Priority dateJan 28, 2021
Publication dateJan 14, 2025
Grant dateJan 14, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method is provided for action localization. The method includes converting one or more video frames into person keypoints and object keypoints. The method further includes embedding position, timestamp, instance, and type information with the person keypoints and object keypoints to obtain keypoint embeddings. The method also includes predicting, by a hierarchical transformer encoder using the keypoint embeddings, human actions and bounding box information of when and where the human actions occur in the one or more video frames.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for action localization, comprising: converting one or more video frames into person keypoints and object keypoints; embedding position, timestamp, instance, and type information with the person keypoints and object keypoints to obtain keypoint embeddings; and predicting, by a hierarchical transformer encoder using the keypoint embeddings, human actions and bounding box information of when and where the human actions occur in the one or more video frames, the embedding including converting the keypoints to tokens, and the predicting including projecting the tokens to embedding metrics and summing the embedding metrics to obtain an output keypoint embedding. 2. The computer-implemented method of claim 1 , wherein said converting converts the one or more video frames into person keypoints in a form of human joint names for each detected person. 3. The computer-implemented method of claim 2 , wherein said converting further comprises selecting a top N out of detected persons based on person detection confidence scores. 4. The computer-implemented method of claim 1 , wherein said converting comprises extracting the object keypoints by subsampling a contour of an object mask detected by a Mask R-CNN. 5. The computer-implemented method of claim 4 , wherein said converting further comprises selecting top N out of detected objects based on object detection confidence scores. 6. The computer-implemented method of claim 1 , further comprising learning atomic actions from the person keypoints and the object keypoints. 7. The computer-implemented method of claim 1 , wherein the position information comprises a down-sampled spatial location of each pixel coordinate. 8. The computer-implemented method of claim 1 , wherein the timestamp information comprises a difference between a keypoint timestamp and a beginning keyframe timestamp. 9. The computer-implemented method of claim 1 , wherein the instance information comprises a spatial correlation between the person keypoints and a person instance. 10. The computer-implemented method of claim 1 , wherein the type information comprises a human body part name. 11. The computer-implemented method of claim 1 , wherein the position, timestamp, instance, and type information comprise representative tokens that are linearly projected to a respective embedding metric and summed to obtain an output keypoint embedding through a transformer based Keypoint Embedding Network. 12. The computer-implemented method of claim 1 , further comprising controlling a vehicle system for accident avoidance responsive to the predicted human actions and the bounding box information. 13. The computer-implemented method of claim 1 , further comprising controlling a robotic system for collision avoidance responsive to the predicted human actions and the bounding box information. 14. A computer program product for action localization, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: converting, by a processor device of the computer, one or more video frames into person keypoints and object keypoints; embedding, by the processor device, position, timestamp, instance, and type information with the person keypoints and object keypoints to obtain keypoint embeddings; and predicting, by a hierarchical transformer encoder of the computer using the embedded keypoints, human actions and bounding box information of when and where the human actions occur in the one or more video frames, the embedding including converting the keypoints to tokens, and the predicting including projecting the tokens to embedding metrics and summing the embedding metrics to obtain an output keypoint embedding. 15. The computer program product of claim 14 , wherein said converting converts the one or more video frames into person keypoints in a form of human joint names for each detected persons. 16. The computer program product of claim 15 , wherein said converting further comprises selecting a top N out of the detected persons based on person detection confidence scores. 17. The computer program product of claim 14 , wherein said converting comprises extracting the object keypoints by subsampling a contour of a mask detected by a Mask R-CNN. 18. The computer program product of claim 17 , wherein said converting further comprises selecting top N out of detected objects based on object detection confidence scores. 19. The computer program product of claim 14 , further comprising learning atomic actions in the person keypoints and the object keypoints. 20. A computer processing system for action localization, comprising: a memory device for storing program code; a processor device operatively coupled to the memory device for running the program code for: converting one or more video frames into person keypoints and object keypoints; embedding position, timestamp, instance, and type information with the person keypoints and object keypoints to obtain keypoint embeddings; and predicting, using a hierarchical transformer encoder that inputs the keypoint embeddings, human actions and bounding box information of when and where the human actions occur in the one or more video frames, the embedding including converting the keypoints to tokens, and the predicting including projecting the tokens to embedding metrics and summing the embedding metrics to obtain an output keypoint embedding.

Assignees

Inventors

Classifications

  • Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components · CPC title

  • for active traffic, e.g. moving vehicles, pedestrians, bikes · CPC title

  • Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads · CPC title

  • using neural networks · CPC title

  • Obstacle · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12198397B2 cover?
A computer-implemented method is provided for action localization. The method includes converting one or more video frames into person keypoints and object keypoints. The method further includes embedding position, timestamp, instance, and type information with the person keypoints and object keypoints to obtain keypoint embeddings. The method also includes predicting, by a hierarchical transfo…
Who is the assignee on this patent?
Nec Lab America Inc, Nec Corp
What technology area does this patent fall under?
Primary CPC classification G06V10/26. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 14 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).