Who is the assignee on this patent?

Verdie Yannick, Yang Zihao, Sridhar Deepak, and 3 more

What technology area does this patent fall under?

Primary CPC classification G06F3/017. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Nov 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Methods and systems for 3D hand pose estimation from RGB images

US12148097B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12148097-B2
Application number	US-202218078832-A
Country	US
Kind code	B2
Filing date	Dec 9, 2022
Priority date	Dec 9, 2022
Publication date	Nov 19, 2024
Grant date	Nov 19, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems for estimation of a 3D hand pose are disclosed. A 2D image containing a detected hand is processed using a U-net network to obtain a global feature vector and a heatmap for the keypoints of the hand. Information from the global feature vector and the heatmap are concatenated to obtain a set of input tokens that are processed using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first view. The first set of 2D keypoints are inputted as a query to a transformer decoder, to obtain a second set of 2D keypoints representing estimated 2D locations of the keypoints in a second view. The first and second sets of 2D keypoints are aggregated to output the set of estimated 3D keypoints.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computing system comprising: a processing unit configured to execute instructions to cause the computing system to estimate a set of 3D keypoints representing a 3D hand pose by: processing a 2D image containing a detected hand using a U-net network to obtain a global feature vector and a heatmap for each of the keypoints; concatenating information from the global feature vector and the heatmap to obtain a set of input tokens; processing the input tokens using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first 2D view; inputting the first set of 2D keypoints as a query to a transformer decoder, with cross-attention from the transformer encoder, to obtain a second set of 2D keypoints representing estimated 2D locations of the keypoints in a second 2D view; and aggregating the first and second sets of 2D keypoints to output the set of estimated 3D keypoints. 2. The computing system of claim 1 , wherein the processing unit is further configured to execute instructions to cause the computing system to perform operations of: estimating a set of one or more bone lengths from the first set of 2D keypoints; processing the first set of 2D keypoints and the estimated set of one or more bone lengths using respective one-layer feedforward networks, to obtain a plurality of hand class estimations representing a handedness of the detected hand; and aggregating the plurality of hand class estimations to obtain an estimated hand class representing an estimated handedness of the detected hand. 3. The computing system of claim 1 , wherein the processing unit is further configured to execute instructions to cause the computing system to perform operations of: rendering a 3D model representation of the detected hand by mapping the set of estimated 3D keypoints to a 3D mesh. 4. The computing system of claim 1 , wherein the processing unit is further configured to execute instructions to cause the computing system to perform operations of: performing gesture recognition by processing the set of estimated 3D keypoints using a gesture recognition software module. 5. The computing system of claim 1 , wherein the first set of 2D keypoints represent estimated 2D locations of the keypoints in the first 2D view corresponding to a view of the detected hand captured in the 2D image, and wherein the second set of 2D keypoints represent estimated 2D locations of the keypoints in the second 2D view corresponding to a different view of the detected hand not captured in the 2D image. 6. The computing system of claim 1 , wherein concatenating information from the global feature vector and the heatmap to obtain a set of input tokens comprises: processing the global feature vector using a feedforward network to obtain a regressor representing a 2D global estimation of each respective keypoint; processing the heatmap using a spatial softmax layer to obtain a 2D heatmap estimation of each respective keypoint; and concatenating the global feature vector with the 2D global estimation and the 2D heatmap estimation to obtain the input token for each respective keypoint. 7. The computing system of claim 1 , wherein the U-net network, the transformer encoder and the transformer decoder are trained together end-to-end, using a training dataset comprising 2D images annotated with keypoints corresponding to joints of a hand and without annotation of vertices corresponding to a 3D mesh of the hand. 8. The computing system of claim 1 , wherein the computing system is one of: a mobile device; a smart appliance; an Internet of Things (IOT) device; an augmented reality (AR) device; or a virtual reality (VR) device. 9. A computer-implemented method for estimating a set of 3D keypoints representing a 3D hand pose comprising: processing a 2D image containing a detected hand using a U-net network to obtain a global feature vector and a heatmap for each of the keypoints; concatenating information from the global feature vector and the heatmap to obtain a set of input tokens; processing the input tokens using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first 2D view; inputting the first set of 2D keypoints as a query to a transformer decoder, with cross-attention from the transformer encoder, to obtain a second set of 2D keypoints representing estimated 2D locations of the keypoints in a second 2D view; and aggregating the first and second sets of 2D keypoints to output the set of estimated 3D keypoints. 10. The method of claim 9 , further comprising: estimating a set of one or more bone lengths from the first set of 2D keypoints; processing the first set of 2D keypoints and the estimated set of one or more bone lengths using respective one-layer feedforward networks, to obtain a plurality of hand class estimations representing a handedness of the detected hand; and aggregating the plurality of hand class estimations to obtain an estimated hand class representing an estimated handedness of the detected hand. 11. The method of claim 9 , further comprising: rendering a 3D model representation of the detected hand by mapping the set of estimated 3D keypoints to a 3D mesh. 12. The method of claim 9 , further comprising: performing gesture recognition by processing the set of estimated 3D keypoints using a gesture recognition software module. 13. The method of claim 9 , wherein the first set of 2D keypoints represent estimated 2D locations of the keypoints in the first 2D view corresponding to a view of the detected hand captured in the 2D image, and wherein the second set of 2D keypoints represent estimated 2D locations of the keypoints in the second 2D view corresponding to a different view of the detected hand not captured in the 2D image. 14. The method of claim 9 , wherein concatenating information from the global feature vector and the heatmap to obtain a set of input tokens comprises: processing the global feature vector using a feedforward network to obtain a regressor representing a 2D global estimation of each respective keypoint; processing the heatmap using a spatial softmax layer to obtain a 2D heatmap estimation of each respective keypoint; and concatenating the global feature vector with the 2D global estimation and the 2D heatmap estimation to obtain the input token for each respective keypoint. 15. The method of claim 9 , wherein the U-net network, the transformer encoder and the transformer decoder are trained together end-to-end, using a training dataset comprising 2D images annotated with keypoints corresponding to joints of a hand and without annotation of vertices corresponding to a 3D mesh of the hand. 16. A non-transitory computer-readable medium having instructions encoded thereon, wherein the instructions are executable by a processing unit of a computing system to cause the computing system to perform operations of: processing a 2D image containing a detected hand using a U-net network to obtain a global feature vector and a heatmap for each of the keypoints; concatenating information from the global feature vector and the heatmap to obtain a set of input tokens; processing the input tokens using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first 2D view; inputting the first set of 2D keypoints as a query to a transformer decoder, with cross-attention from the transformer encoder, to obtain a second set of 2D keypoints representing estimat

Assignees

Inventors

Classifications

G06V40/28
Recognition of hand or arm movements, e.g. recognition of deaf sign language (static hand signs G06V40/113) · CPC title
G06F3/017Primary
Gesture based interaction, e.g. based on a set of recognized hand gestures (interaction based on gestures traced on a digitiser G06F3/04883) · CPC title
G06T17/20Primary
Finite element generation, e.g. wire-frame surface description, {tesselation} · CPC title

Patent family

Related publications grouped by family.

View patent family 91381067

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12148097B2 cover?: Methods and systems for estimation of a 3D hand pose are disclosed. A 2D image containing a detected hand is processed using a U-net network to obtain a global feature vector and a heatmap for the keypoints of the hand. Information from the global feature vector and the heatmap are concatenated to obtain a set of input tokens that are processed using a transformer encoder to obtain a first set …
Who is the assignee on this patent?: Verdie Yannick, Yang Zihao, Sridhar Deepak, and 3 more
What technology area does this patent fall under?: Primary CPC classification G06F3/017. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Nov 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Methods And Apparatus For Machine Learning To Analyze Musculo-Skeletal Rehabilitation From Images

Hand key point recognition model training method, hand key point recognition method and device

System and method to predict, prevent, and mitigate workplace injuries

Frequently asked questions