Method and system for obtaining joint positions, and method and system for motion capture
US-2022108468-A1 · Apr 7, 2022 · US
US12033352B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12033352-B2 |
| Application number | US-202117537805-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 30, 2021 |
| Priority date | Dec 3, 2020 |
| Publication date | Jul 9, 2024 |
| Grant date | Jul 9, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure herein provides methods and systems that solves the technical problems of generating an efficient, accurate and light-weight 3-Dimensional (3-D) pose estimation framework for estimating the 3-D pose of an object present in an image used for the 3-dimensional (3D) model registration using deep learning, by training a composite network model with both shape features and image features of the object. The composite network model includes a graph neural network (GNN) for capturing the shape features of the object and a convolution neural network (CNN) for capturing the image features of the object. The graph neural network (GNN) utilizes the local neighbourhood information through the image features of the object and at the same time maintaining global shape property through the shape features of the object, to estimate the 3-D pose of the object.
Opening claim text (preview).
What is claimed is: 1. A processor-implemented method comprising the steps of: receiving, via one or more hardware processors, (i) an RGB image, (ii) a three-dimensional (3-D) model, and (iii) 3-D pose values, of each object of a plurality of objects, from an image repository; generating, via the one or more hardware processors, a skeletal graph of each object of the plurality of objects, from the 3-D model associated with the object, using a transformation algorithm; generating, via the one or more hardware processors, an end-to-end model to estimate the 3-D pose of the object, by training a composite network model with (i) the RGB image, (ii) the skeletal graph, and (iii) the 3-D pose values, of each object at a time of the plurality of objects, wherein the composite network model comprises a graph neural network (GNN), a convolution neural network (CNN), and a fully connected network (FCN), and wherein training the composite network model with (i) the RGB image, (ii) the skeletal graph, and (iii) the 3-D pose values, of each object comprises: passing the skeletal graph of each object to the GNN, to generate a shape feature vector of each object; passing the RGB image of each object to the CNN, to generate an object feature vector of each object; concatenating the shape feature vector and the object feature vector of each object, to obtain a shape-image feature vector of each object; passing the shape-image feature vector of each object to the FCN, to generate a predicted pose feature vector of each object, wherein the predicted pose feature vector of each object comprises predicted 3-D pose values corresponding to the object; minimizing a loss function, wherein the loss function defines a loss between the predicted 3-D pose values, and the 3-D pose values of the object; and updating weights of the composite network model, based on the loss function. 2. The method of claim 1 , further comprising: receiving, via the one or more hardware processors, (i) an input RGB image, and (ii) an input 3-D model, of an input object, whose 3-D pose is to be estimated; generating, via the one or more hardware processors, an input skeletal graph of the input object, from the input 3-D model, using the transformation algorithm; and passing, via the one or more hardware processors, (i) the input RGB image, and (ii) the input skeletal graph of the input object, to the end-to-end model, to estimate an input pose feature vector of the input object, wherein the input pose feature vector comprises input 3-D pose values that defines the 3-D pose of the input object. 3. The method of claim 2 , wherein generating the input skeletal graph of the input object, from the input 3-D model, using the transformation algorithm, comprises: voxelizing the input 3-D model of the input object to obtain an input skeleton model of the input object, using a voxelization function of the transformation algorithm, wherein the input skeleton model of the input object comprises one or more one-dimensional input skeleton voxels associated with the input object; and transforming the input skeleton model of the input object to generate the input skeletal graph of the input object, using a skeleton-to-graph transformation function of the transformation algorithm. 4. The method of claim 1 , wherein generating the skeletal graph of each object of the plurality of objects, from the 3-D model associated with the object, using the transformation algorithm, comprises: voxelizing the 3-D model associated with the object to obtain a skeleton model of the object, using a voxelization function of the transformation algorithm, wherein the skeleton model of the object comprises one or more one-dimensional skeleton voxels associated with the object; and transforming the skeleton model of the object to generate the skeletal graph of the corresponding object, using a skeleton-to-graph transformation function of the transformation algorithm. 5. The method of claim 1 , wherein the graph neural network (GNN) works as a message passing network, and comprises 3 edge convolutional blocks, a sum pooling layer, and a graph encoder, each edge convolutional block comprises a edge convolution layer followed by three neural blocks, wherein each neural block comprises a fully connected layer, a batch normalization layer, and a Rectified Linear Unit (ReLU) layer. 6. The method of claim 1 , wherein the convolution neural network (CNN) comprises a set of convolution layers, an average pooling layer and a CNN fully connected layer. 7. The method of claim 1 , wherein the fully connected network (FCN) comprises three neural blocks, wherein each neural block comprises a fully connected layer, a batch normalization layer, and a Rectified Linear Unit (ReLU) layer. 8. A system comprising: a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to: receive (i) an RGB image, (ii) a three-dimensional (3-D) model, and (iii) 3-D pose values, of each object of a plurality of objects, from an image repository; generate a skeletal graph of each object of the plurality of objects, from the 3-D model associated with the object, using a transformation algorithm; generate an end-to-end model to estimate the 3-D pose of the object, by training a composite network model with (i) the RGB image, (ii) the skeletal graph, and (iii) the 3-D pose values, of each object at a time of the plurality of objects, wherein the composite network model comprises a graph neural network (GNN), a convolution neural network (CNN), and a fully connected network (FCN), and wherein training the composite network model with (i) the RGB image, (ii) the skeletal graph, and (iii) the 3-D pose values, of each object comprises: passing the skeletal graph of each object to the GNN, to generate a shape feature vector of each object; passing the RGB image of each object to the CNN, to generate an object feature vector of each object; concatenating the shape feature vector and the object feature vector of each object, to obtain a shape-image feature vector of each object; passing the shape-image feature vector of each object to the FCN, to generate a predicted pose feature vector of each object, wherein the predicted pose feature vector of each object comprises predicted 3-D pose values corresponding to the object; minimizing a loss function, wherein the loss function defines a loss between the predicted 3-D pose values, and the 3-D pose values of the object; and updating weights of the composite network model, based on the loss function. 9. The system of claim 8 , wherein the one or more hardware processors are further configured to: receive (i) an input RGB image, and (ii) an input 3-D model, of an input object, whose 3-D pose is to be estimated; generate an input skeletal graph of the input object, from the input 3-D model, using the transformation algorithm; and pass (i) the input RGB image, and (ii) the input skeletal graph of the input object, to the end-to-end model, to estimate an input pose feature vector of the input object, wherein the input pose feature vector comprises input 3-D pose values that defines the 3-D pose of the input object. 10. The system of claim 9 , wherein the one or more hardware processors are configured to generate the input skeletal graph of the input object, from the input 3-D model, using the transformation algorithm, by: voxelizing the input 3-D model of the input object to obtain an input skeleton model of the input object, using a voxelization function of the transformation algorithm, whe
Artificial neural networks [ANN] · CPC title
Training; Learning · CPC title
Color image · CPC title
from multiple images · CPC title
Graphical representations · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.