Methods for real-time skill assessment of multi-step tasks performed by hand movements using a video camera
US-2020167715-A1 · May 28, 2020 · US
US11106949B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11106949-B2 |
| Application number | US-201916362530-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 22, 2019 |
| Priority date | Mar 22, 2019 |
| Publication date | Aug 31, 2021 |
| Grant date | Aug 31, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computing device, including a processor configured to receive a first video including a plurality of frames. For each frame, the processor may determine that a target region of the frame includes a target object. The processor may determine a surrounding region within which the target region is located. The surrounding region may be smaller than the frame. The processor may identify one or more features located in the surrounding region. From the one or more features, the processor may generate one or more manipulated object identifiers. For each of a plurality of pairs of frames, the processor may determine a respective manipulated object movement between a first manipulated object identifier of the first frame and a second manipulated object identifier of the second frame. The processor may classify at least one action performed in the first video based on the plurality of manipulated object movements.
Opening claim text (preview).
The invention claimed is: 1. A computing device comprising: a processor configured to: receive a first video including a plurality of frames; for each frame of the plurality of frames: determine that a target region of the frame includes a target object; determine a surrounding region within which the target region is located, wherein the surrounding region is smaller than the frame and the target region is smaller than the surrounding region; extract one or more features located in the surrounding region; and from the one or more features, generate one or more manipulated object identifiers; for each of a plurality of pairs of frames of the first video respectively including a first frame and a second frame, determine a respective manipulated object movement between a first manipulated object identifier of the first frame and a second manipulated object identifier of the second frame; and classify at least one action performed in the first video based on the manipulated object movements. 2. The computing device of claim 1 , wherein the target object is a hand. 3. The computing device of claim 2 , wherein the one or more manipulated object identifiers respectively identify one or more manipulated objects manipulated by the hand. 4. The computing device of claim 3 , wherein the processor is configured to classify the at least one action at least in part by inputting the manipulated object movements into a grasp classifier, wherein the grasp classifier is configured to output a grasp label indicating a grasp type with which the hand grasps the one or more manipulated objects. 5. The computing device of claim 4 , wherein the grasp classifier is a recurrent neural network. 6. The computing device of claim 2 , wherein the processor is configured to determine that the target region of the frame includes a hand at least in part by inputting the frame into a hand detector selected from the group consisting of a recurrent neural network, a three-dimensional convolutional neural network, and a temporal convolutional neural network. 7. The computing device of claim 1 , wherein the processor is further configured to: classify a plurality of actions performed in the first video; and segment the first video into a plurality of activity phases, wherein the plurality of activity phases are defined by one or more respective actions of the plurality of actions performed during that activity phase. 8. The computing device of claim 7 , wherein the processor is further configured to: generate a plurality of action labels respectively corresponding to the plurality of actions; and output a first video annotation including each action label of the plurality of action labels, wherein the action label of each action is matched to a respective activity phase in which that action is performed. 9. The computing device of claim 7 , wherein the processor is further configured to: receive a second video; classify a second video action performed in the second video; determine that the second video action matches an action of the plurality of actions identified in the first video; and output a second video annotation in response to the determination that the second video action matches the action. 10. The computing device of claim 9 , wherein the second video annotation includes a subsequent phase action label associated with a subsequent activity phase following a second video activity phase associated with the second video action. 11. The computing device of claim 1 , wherein the processor is configured to generate the one or more manipulated object identifiers at least in part by inputting the one or more features into a manipulated object classifier selected from the group consisting of a recurrent neural network, a three-dimensional convolutional neural network, and a temporal convolutional neural network. 12. The computing device of claim 1 , wherein each manipulated object movement is an optical flow. 13. A method for use with a computing device, the method comprising: receiving a first video including a plurality of frames; for each frame of the plurality of frames: determining that a target region of the frame includes a target object; determining a surrounding region within which the target region is located, wherein the surrounding region is smaller than the frame and the target region is smaller than the surrounding region; extracting one or more features located in the surrounding region; and from the one or more features, generating one or more manipulated object identifiers; for each of a plurality of pairs of frames of the first video respectively including a first frame and a second frame, determining a respective manipulated object movement between a first manipulated object identifier of the first frame and a second manipulated object identifier of the second frame; and classifying at least one action performed in the first video based on the manipulated object movements. 14. The method of claim 13 , wherein the target object is a hand. 15. The method of claim 14 , wherein the one or more manipulated object identifiers respectively identify one or more manipulated objects manipulated by the hand. 16. The method of claim 15 , wherein classifying the at least one action includes inputting the manipulated object movements into a grasp classifier, wherein the grasp classifier is configured to output a grasp label indicating a grasp type with which the hand grasps the one or more manipulated objects. 17. The method of claim 13 , further comprising: classifying a plurality of actions performed in the first video; and segmenting the first video into a plurality of activity phases, wherein the plurality of activity phases are defined by one or more respective actions of the plurality of actions performed during that activity phase. 18. The method of claim 17 , further comprising: generating a plurality of action labels respectively corresponding to the plurality of actions; and outputting a first video annotation including each action label of the plurality of action labels, wherein the action label of each action is matched to a respective activity phase in which that action is performed. 19. The method of claim 17 , further comprising: receiving a second video; classifying a second video action performed in the second video; determining that the second video action matches an action of the plurality of actions identified in the first video; and outputting a second video annotation in response to the determination that the second video action matches the action. 20. A computing device comprising: a processor configured to: receive a first video including a plurality of frames; for each frame of the plurality of frames: determine that a first target region of the frame includes a first hand and a second target region of the frame includes a second hand; determine a first surrounding region within which the first target region is located and a second surrounding region within which the second target region is located, wherein the first surrounding region and the second surrounding region are each smaller than the frame; identify one or more first surrounding region features located in the first surrounding region; identify one or more second surrounding region features located in the second surrounding region; and from the one or more first surrounding region features and/or the one or more second surrounding region features, generate one or more manipulated object identifiers th
Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title
based on the proximity to a decision surface, e.g. support vector machines · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.