Apparatus and method of hand gesture recognition based on depth image
US-2017068849-A1 · Mar 9, 2017 · US
US10429944B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10429944-B2 |
| Application number | US-201816020245-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 27, 2018 |
| Priority date | Oct 7, 2017 |
| Publication date | Oct 1, 2019 |
| Grant date | Oct 1, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
This disclosure relates generally to hand-gesture recognition, and more particularly to system and method for detecting interaction of 3D dynamic hand gestures with frugal AR devices. In one embodiment, a method for hand-gesture recognition includes receiving frames of a media stream of a scene captured from a FPV of a user using RGB sensor communicably coupled to a wearable AR device. The media stream includes RGB image data associated with the frames of the scene. The scene comprises a dynamic hand gesture performed by the user. Temporal information associated with the dynamic hand gesture is estimated from the RGB image data by using a deep learning model. The estimated temporal information is associated with hand poses of the user and comprises key-points identified on user's hand in the frames. Based on said temporal information, the dynamic hand gesture is classified into predefined gesture classes by using multi-layered LSTM classification network.
Opening claim text (preview).
What is claimed is: 1. A processor-implemented method for hand-gesture recognition, the method comprising: receiving, via one or more hardware processors, a plurality of frames of a media stream of a scene captured from a first person view (FPV) of a user using at least one RGB sensor communicably coupled to a wearable Augmented reality (AR) device, the media stream comprising RGB image data associated with the plurality of frames of the scene, the scene comprising a dynamic hand gesture performed by the user; estimating, via the one or more hardware processors, a temporal information associated with the dynamic hand gesture from the RGB image data by using a deep learning model, the estimated temporal information being associated with hand poses of the user and comprising a plurality of key-points identified on user's hand in the plurality of frames, wherein the plurality of key-points comprises twenty one hand key-points, and wherein each key-point of the twenty one key points comprises four key points per finger and one key-point close to wrist of the user's hand, and wherein estimating the temporal information associated with the dynamic hand gesture comprises: estimating, a plurality of network-implicit 3D articulation priors using the deep learning model, the plurality of network-implicit 3D articulation priors comprising a plurality of key-points determined from a plurality of training sample RGB images of user's hand; and detecting, based on the plurality of network-implicit 3D articulation priors, the plurality of key-points on the user's hand in the plurality of frames; and classifying, by using a multi-layered Long Short Term memory (LSTM) classification network, the dynamic hand gesture into at least one predefined gesture class based on the temporal information associated with the plurality of key points, via the one or more hardware processors. 2. The method of claim 1 , further comprising downscaling the plurality of frames upon capturing the media stream. 3. The method of claim 1 , wherein the multi-layered LSTM classification network comprises: a first layer comprising a LSTM layer consisting of a plurality of LSTM cells to learn long-term dependencies and patterns in a 3D coordinates sequence of the plurality of key-points detected on the user's hand; a second layer comprising a flattening layer that makes the temporal data one-dimensional; and a third layer comprising a fully connected layer with output scores corresponding to each of the dynamic hand gestures, the output scores indicative of posterior probability corresponding to the each of the dynamic hand gestures for classification in the at least one predefined gesture class. 4. The method of claim 3 , further comprising testing the LSTM classification network for classifying the dynamic hand gesture from amongst the plurality of dynamic hand gestures, wherein testing the LSTM classification network comprises: interpreting, by using a softmax activation function, output scores as unnormalized log probabilities and squashing the output scores to be between 0 and 1 using the following equation: σ ( s ) j = e s j ∑ k = 0 K - 1 e s k where, K denotes number of classes, s is a K×1 vector of scores, an input to softmax function, and j is an index varying from 0 to K−1, and σ(s) is K×1 output vector denoting the posterior probabilities associated with each of the plurality of dynamic hand gestures. 5. The method of claim 3 , further comprising training the LSTM classification network, wherein training the LSTM classification network comprises: computing cross-entropy loss Li of ith training sample of the plurality of training sample RGB images by using following equation: L i =−h j *log(σ( s ) j ) where h is a 1×K vector denoting one-hot label of input comprising the plurality of training sample RGB images; and computing a mean of L i over the plurality of training sample images and propagating back in the LSTM classification network to fine tune the LSTM classification network in the training. 6. The method of claim 1 , wherein upon classifying the 3D dynamic hand gesture into the at least one predefined gesture class, communicating the classified at least one predefined gesture class to a at least one of a device embodying the at least one RGB sensor and the wearable AR device, and enabling the device to trigger a pre-defined task. 7. A system for hand-gesture recognition, the system comprising: one or more memories; and one or more hardware processors, the one or more memories coupled to the one or more hardware processors, wherein the one or more hardware processors are capable of executing programmed instructions stored in the one or more memories to: receive a plurality of frames of a media stream of a scene captured from a first person view (FPV) of a user using at least one RGB sensor communicably coupled to a wearable AR device, the media stream comprising RGB image data associated with the plurality of frames of the scene, the scene comprising a dynamic hand gesture performed by the user; estimate a temporal information associated with the dynamic hand gesture from the RGB image data by using a deep learning model, the estimated temporal information being associated with hand poses of the user and comprising a plurality of key-points identified on user's hand in the plurality of frames; wherein the plurality of key-points comprises twenty one hand key-points, and wherein each key-point of the twenty one key points comprises four key points per finger and one key-point close to wrist of the user's hand, and wherein estimating the temporal information associated with the dynamic hand gesture comprises: estimating, a plurality of network-implicit 3D articulation priors using the deep learning model, the plurality of network-implicit 3D articulation priors comprising a plurality of key-points determined from a plurality of training sample RGB images of user's hand; and detecting, based on the plurality of network-implicit 3D articulation priors, the plurality of key-points on the user's hand in the plurality of frames; and classify, by using a multi-layered LSTM classification network, the dynamic hand gesture into at least one predefined gesture class based on the temporal information associated with the plurality of key points. 8. The system of claim 7 , wherein the one or more hardware processors are further configured by the instructions to downscale the plurality of frames upon capturing the media stream. 9. The system of claim 8 , wherein the multi-layered LSTM classification network comprises: a first layer comprising a LSTM layer consisting of a plurality of LSTM cells to learn long-term dependencies and patterns
using neural networks · CPC title
Gesture based interaction, e.g. based on a set of recognized hand gestures (interaction based on gestures traced on a digitiser G06F3/04883) · CPC title
Classification techniques · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Arrangements for interaction with the human body, e.g. for user immersion in virtual reality (blind teaching G09B21/00) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.