Methods And Apparatus For Machine Learning To Analyze Musculo-Skeletal Rehabilitation From Images
US-2022386942-A1 · Dec 8, 2022 · US
US12148097B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12148097-B2 |
| Application number | US-202218078832-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 9, 2022 |
| Priority date | Dec 9, 2022 |
| Publication date | Nov 19, 2024 |
| Grant date | Nov 19, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and systems for estimation of a 3D hand pose are disclosed. A 2D image containing a detected hand is processed using a U-net network to obtain a global feature vector and a heatmap for the keypoints of the hand. Information from the global feature vector and the heatmap are concatenated to obtain a set of input tokens that are processed using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first view. The first set of 2D keypoints are inputted as a query to a transformer decoder, to obtain a second set of 2D keypoints representing estimated 2D locations of the keypoints in a second view. The first and second sets of 2D keypoints are aggregated to output the set of estimated 3D keypoints.
Opening claim text (preview).
The invention claimed is: 1. A computing system comprising: a processing unit configured to execute instructions to cause the computing system to estimate a set of 3D keypoints representing a 3D hand pose by: processing a 2D image containing a detected hand using a U-net network to obtain a global feature vector and a heatmap for each of the keypoints; concatenating information from the global feature vector and the heatmap to obtain a set of input tokens; processing the input tokens using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first 2D view; inputting the first set of 2D keypoints as a query to a transformer decoder, with cross-attention from the transformer encoder, to obtain a second set of 2D keypoints representing estimated 2D locations of the keypoints in a second 2D view; and aggregating the first and second sets of 2D keypoints to output the set of estimated 3D keypoints. 2. The computing system of claim 1 , wherein the processing unit is further configured to execute instructions to cause the computing system to perform operations of: estimating a set of one or more bone lengths from the first set of 2D keypoints; processing the first set of 2D keypoints and the estimated set of one or more bone lengths using respective one-layer feedforward networks, to obtain a plurality of hand class estimations representing a handedness of the detected hand; and aggregating the plurality of hand class estimations to obtain an estimated hand class representing an estimated handedness of the detected hand. 3. The computing system of claim 1 , wherein the processing unit is further configured to execute instructions to cause the computing system to perform operations of: rendering a 3D model representation of the detected hand by mapping the set of estimated 3D keypoints to a 3D mesh. 4. The computing system of claim 1 , wherein the processing unit is further configured to execute instructions to cause the computing system to perform operations of: performing gesture recognition by processing the set of estimated 3D keypoints using a gesture recognition software module. 5. The computing system of claim 1 , wherein the first set of 2D keypoints represent estimated 2D locations of the keypoints in the first 2D view corresponding to a view of the detected hand captured in the 2D image, and wherein the second set of 2D keypoints represent estimated 2D locations of the keypoints in the second 2D view corresponding to a different view of the detected hand not captured in the 2D image. 6. The computing system of claim 1 , wherein concatenating information from the global feature vector and the heatmap to obtain a set of input tokens comprises: processing the global feature vector using a feedforward network to obtain a regressor representing a 2D global estimation of each respective keypoint; processing the heatmap using a spatial softmax layer to obtain a 2D heatmap estimation of each respective keypoint; and concatenating the global feature vector with the 2D global estimation and the 2D heatmap estimation to obtain the input token for each respective keypoint. 7. The computing system of claim 1 , wherein the U-net network, the transformer encoder and the transformer decoder are trained together end-to-end, using a training dataset comprising 2D images annotated with keypoints corresponding to joints of a hand and without annotation of vertices corresponding to a 3D mesh of the hand. 8. The computing system of claim 1 , wherein the computing system is one of: a mobile device; a smart appliance; an Internet of Things (IOT) device; an augmented reality (AR) device; or a virtual reality (VR) device. 9. A computer-implemented method for estimating a set of 3D keypoints representing a 3D hand pose comprising: processing a 2D image containing a detected hand using a U-net network to obtain a global feature vector and a heatmap for each of the keypoints; concatenating information from the global feature vector and the heatmap to obtain a set of input tokens; processing the input tokens using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first 2D view; inputting the first set of 2D keypoints as a query to a transformer decoder, with cross-attention from the transformer encoder, to obtain a second set of 2D keypoints representing estimated 2D locations of the keypoints in a second 2D view; and aggregating the first and second sets of 2D keypoints to output the set of estimated 3D keypoints. 10. The method of claim 9 , further comprising: estimating a set of one or more bone lengths from the first set of 2D keypoints; processing the first set of 2D keypoints and the estimated set of one or more bone lengths using respective one-layer feedforward networks, to obtain a plurality of hand class estimations representing a handedness of the detected hand; and aggregating the plurality of hand class estimations to obtain an estimated hand class representing an estimated handedness of the detected hand. 11. The method of claim 9 , further comprising: rendering a 3D model representation of the detected hand by mapping the set of estimated 3D keypoints to a 3D mesh. 12. The method of claim 9 , further comprising: performing gesture recognition by processing the set of estimated 3D keypoints using a gesture recognition software module. 13. The method of claim 9 , wherein the first set of 2D keypoints represent estimated 2D locations of the keypoints in the first 2D view corresponding to a view of the detected hand captured in the 2D image, and wherein the second set of 2D keypoints represent estimated 2D locations of the keypoints in the second 2D view corresponding to a different view of the detected hand not captured in the 2D image. 14. The method of claim 9 , wherein concatenating information from the global feature vector and the heatmap to obtain a set of input tokens comprises: processing the global feature vector using a feedforward network to obtain a regressor representing a 2D global estimation of each respective keypoint; processing the heatmap using a spatial softmax layer to obtain a 2D heatmap estimation of each respective keypoint; and concatenating the global feature vector with the 2D global estimation and the 2D heatmap estimation to obtain the input token for each respective keypoint. 15. The method of claim 9 , wherein the U-net network, the transformer encoder and the transformer decoder are trained together end-to-end, using a training dataset comprising 2D images annotated with keypoints corresponding to joints of a hand and without annotation of vertices corresponding to a 3D mesh of the hand. 16. A non-transitory computer-readable medium having instructions encoded thereon, wherein the instructions are executable by a processing unit of a computing system to cause the computing system to perform operations of: processing a 2D image containing a detected hand using a U-net network to obtain a global feature vector and a heatmap for each of the keypoints; concatenating information from the global feature vector and the heatmap to obtain a set of input tokens; processing the input tokens using a transformer encoder to obtain a first set of 2D keypoints representing estimated 2D locations of the keypoints in a first 2D view; inputting the first set of 2D keypoints as a query to a transformer decoder, with cross-attention from the transformer encoder, to obtain a second set of 2D keypoints representing estimat
Recognition of hand or arm movements, e.g. recognition of deaf sign language (static hand signs G06V40/113) · CPC title
Gesture based interaction, e.g. based on a set of recognized hand gestures (interaction based on gestures traced on a digitiser G06F3/04883) · CPC title
Finite element generation, e.g. wire-frame surface description, {tesselation} · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.