Methods and systems for generating end-to-end model to estimate 3-dimensional(3-D) pose of object

US12033352B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12033352-B2
Application numberUS-202117537805-A
CountryUS
Kind codeB2
Filing dateNov 30, 2021
Priority dateDec 3, 2020
Publication dateJul 9, 2024
Grant dateJul 9, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure herein provides methods and systems that solves the technical problems of generating an efficient, accurate and light-weight 3-Dimensional (3-D) pose estimation framework for estimating the 3-D pose of an object present in an image used for the 3-dimensional (3D) model registration using deep learning, by training a composite network model with both shape features and image features of the object. The composite network model includes a graph neural network (GNN) for capturing the shape features of the object and a convolution neural network (CNN) for capturing the image features of the object. The graph neural network (GNN) utilizes the local neighbourhood information through the image features of the object and at the same time maintaining global shape property through the shape features of the object, to estimate the 3-D pose of the object.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor-implemented method comprising the steps of: receiving, via one or more hardware processors, (i) an RGB image, (ii) a three-dimensional (3-D) model, and (iii) 3-D pose values, of each object of a plurality of objects, from an image repository; generating, via the one or more hardware processors, a skeletal graph of each object of the plurality of objects, from the 3-D model associated with the object, using a transformation algorithm; generating, via the one or more hardware processors, an end-to-end model to estimate the 3-D pose of the object, by training a composite network model with (i) the RGB image, (ii) the skeletal graph, and (iii) the 3-D pose values, of each object at a time of the plurality of objects, wherein the composite network model comprises a graph neural network (GNN), a convolution neural network (CNN), and a fully connected network (FCN), and wherein training the composite network model with (i) the RGB image, (ii) the skeletal graph, and (iii) the 3-D pose values, of each object comprises: passing the skeletal graph of each object to the GNN, to generate a shape feature vector of each object; passing the RGB image of each object to the CNN, to generate an object feature vector of each object; concatenating the shape feature vector and the object feature vector of each object, to obtain a shape-image feature vector of each object; passing the shape-image feature vector of each object to the FCN, to generate a predicted pose feature vector of each object, wherein the predicted pose feature vector of each object comprises predicted 3-D pose values corresponding to the object; minimizing a loss function, wherein the loss function defines a loss between the predicted 3-D pose values, and the 3-D pose values of the object; and updating weights of the composite network model, based on the loss function. 2. The method of claim 1 , further comprising: receiving, via the one or more hardware processors, (i) an input RGB image, and (ii) an input 3-D model, of an input object, whose 3-D pose is to be estimated; generating, via the one or more hardware processors, an input skeletal graph of the input object, from the input 3-D model, using the transformation algorithm; and passing, via the one or more hardware processors, (i) the input RGB image, and (ii) the input skeletal graph of the input object, to the end-to-end model, to estimate an input pose feature vector of the input object, wherein the input pose feature vector comprises input 3-D pose values that defines the 3-D pose of the input object. 3. The method of claim 2 , wherein generating the input skeletal graph of the input object, from the input 3-D model, using the transformation algorithm, comprises: voxelizing the input 3-D model of the input object to obtain an input skeleton model of the input object, using a voxelization function of the transformation algorithm, wherein the input skeleton model of the input object comprises one or more one-dimensional input skeleton voxels associated with the input object; and transforming the input skeleton model of the input object to generate the input skeletal graph of the input object, using a skeleton-to-graph transformation function of the transformation algorithm. 4. The method of claim 1 , wherein generating the skeletal graph of each object of the plurality of objects, from the 3-D model associated with the object, using the transformation algorithm, comprises: voxelizing the 3-D model associated with the object to obtain a skeleton model of the object, using a voxelization function of the transformation algorithm, wherein the skeleton model of the object comprises one or more one-dimensional skeleton voxels associated with the object; and transforming the skeleton model of the object to generate the skeletal graph of the corresponding object, using a skeleton-to-graph transformation function of the transformation algorithm. 5. The method of claim 1 , wherein the graph neural network (GNN) works as a message passing network, and comprises 3 edge convolutional blocks, a sum pooling layer, and a graph encoder, each edge convolutional block comprises a edge convolution layer followed by three neural blocks, wherein each neural block comprises a fully connected layer, a batch normalization layer, and a Rectified Linear Unit (ReLU) layer. 6. The method of claim 1 , wherein the convolution neural network (CNN) comprises a set of convolution layers, an average pooling layer and a CNN fully connected layer. 7. The method of claim 1 , wherein the fully connected network (FCN) comprises three neural blocks, wherein each neural block comprises a fully connected layer, a batch normalization layer, and a Rectified Linear Unit (ReLU) layer. 8. A system comprising: a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to: receive (i) an RGB image, (ii) a three-dimensional (3-D) model, and (iii) 3-D pose values, of each object of a plurality of objects, from an image repository; generate a skeletal graph of each object of the plurality of objects, from the 3-D model associated with the object, using a transformation algorithm; generate an end-to-end model to estimate the 3-D pose of the object, by training a composite network model with (i) the RGB image, (ii) the skeletal graph, and (iii) the 3-D pose values, of each object at a time of the plurality of objects, wherein the composite network model comprises a graph neural network (GNN), a convolution neural network (CNN), and a fully connected network (FCN), and wherein training the composite network model with (i) the RGB image, (ii) the skeletal graph, and (iii) the 3-D pose values, of each object comprises: passing the skeletal graph of each object to the GNN, to generate a shape feature vector of each object; passing the RGB image of each object to the CNN, to generate an object feature vector of each object; concatenating the shape feature vector and the object feature vector of each object, to obtain a shape-image feature vector of each object; passing the shape-image feature vector of each object to the FCN, to generate a predicted pose feature vector of each object, wherein the predicted pose feature vector of each object comprises predicted 3-D pose values corresponding to the object; minimizing a loss function, wherein the loss function defines a loss between the predicted 3-D pose values, and the 3-D pose values of the object; and updating weights of the composite network model, based on the loss function. 9. The system of claim 8 , wherein the one or more hardware processors are further configured to: receive (i) an input RGB image, and (ii) an input 3-D model, of an input object, whose 3-D pose is to be estimated; generate an input skeletal graph of the input object, from the input 3-D model, using the transformation algorithm; and pass (i) the input RGB image, and (ii) the input skeletal graph of the input object, to the end-to-end model, to estimate an input pose feature vector of the input object, wherein the input pose feature vector comprises input 3-D pose values that defines the 3-D pose of the input object. 10. The system of claim 9 , wherein the one or more hardware processors are configured to generate the input skeletal graph of the input object, from the input 3-D model, using the transformation algorithm, by: voxelizing the input 3-D model of the input object to obtain an input skeleton model of the input object, using a voxelization function of the transformation algorithm, whe

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12033352B2 cover?
The present disclosure herein provides methods and systems that solves the technical problems of generating an efficient, accurate and light-weight 3-Dimensional (3-D) pose estimation framework for estimating the 3-D pose of an object present in an image used for the 3-dimensional (3D) model registration using deep learning, by training a composite network model with both shape features and ima…
Who is the assignee on this patent?
Tata Consultancy Services Ltd, Tata Consultancy Ltd Services
What technology area does this patent fall under?
Primary CPC classification G06T7/75. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 09 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).