Efficient human pose tracking in videos

US12165335B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12165335-B2
Application numberUS-202318460335-A
CountryUS
Kind codeB2
Filing dateSep 1, 2023
Priority dateNov 30, 2018
Publication dateDec 10, 2024
Grant dateDec 10, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, devices, media and methods are presented for a human pose tracking framework. The human pose tracking framework may identify a message with video frames, generate, using a composite convolutional neural network, joint data representing joint locations of a human depicted in the video frames, the generating of the joint data by the composite convolutional neural network done by a deep convolutional neural network operating on one portion of the video frames, a shallow convolutional neural network operating on a another portion of the video frames, and tracking the joint locations using a one-shot learner neural network that is trained to track the joint locations based on a concatenation of feature maps and a convolutional pose machine. The human pose tracking framework may store, the joint locations, and cause presentation of a rendition of the joint locations on a user interface of a client device.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: generating joint data representing a plurality of joint locations of a human depicted in a plurality of video frames comprising a first set of video frames and a second set of video frames, the generating of the joint data comprising tracking the plurality of joint locations using a learner neural network that is trained to track the plurality of joint locations based on a concatenation of: feature maps corresponding to the plurality of video frames; and pose estimation results produced by a trained convolutional pose machine, the pose estimation results corresponding to the plurality of joint locations in the plurality of video frames; and generating updated pose estimation results using a correlation between the first set of video frames and the second set of video frames, the correlation being computed based on the concatenation and the second set of video frames. 2. The method of claim 1 , further comprising: operating on the first set of video frames using a first type of neural network. 3. The method of claim 2 , further comprising: operating on the second set of video frames using a second type of neural network. 4. The method of claim 3 , wherein the first type of neural network is different than the second type of neural network. 5. The method of claim 3 , wherein the first type of network is a deep convolutional neural network and the second type of neural network is a shallow convolutional neural network. 6. The method of claim 5 , wherein a number of layers in the deep convolutional neural network is at least five. 7. The method of claim 5 , wherein the deep convolutional neural network has a deconvolution layer, the deconvolution layer up-sampling at least one feature map. 8. The method of claim 3 , wherein the feature maps are produced by the first type of convolutional neural network and the second type of convolutional neural network. 9. The method of claim 1 , wherein the learner neural network is a one shot learner neural network. 10. The method of claim 1 , further comprising: generating, based on the concatenation, a template of key points representing the plurality of joint locations; and generating updated pose estimation results based on the template of key points. 11. The method of claim 10 , wherein: computing the correlation between the first set of video frames and second set of video frames comprises using a correlation filter; and the template of key points is directly outputted to the correlation filter. 12. The method of claim 1 , further comprising storing the updated pose estimation results of the human depicted in the plurality of video frames. 13. The method of claim 1 , wherein: the first set of video frames includes a first video frame; and the second set of video frames includes at least one subsequent video frame that appears after the first video frame in a sequence of frames. 14. The method of claim 1 , wherein the learner neural network is a convolutional neural network. 15. The method of claim 1 , further comprising causing presentation of the updated pose estimation results of the human on a user interface of a client device. 16. A system comprising: at least one processor; and a memory storing instructions that, when executed by the at least one processor, configure the system to perform operations comprising: generating joint data representing a plurality of joint locations of a human depicted in a plurality of video frames comprising a first set of video frames and a second set of video frames, the generating of the joint data comprising tracking the plurality of joint locations using a learner neural network that is trained to track the plurality of joint locations based on a concatenation of: feature maps corresponding to the plurality of video frames; and pose estimation results produced by a trained convolutional pose machine, the pose estimation results corresponding to the plurality of joint locations in the plurality of video frames; and generating updated pose estimation results using a correlation between the first set of video frames and the second set of video frames, the correlation being computed based on the concatenation and the second set of video frames. 17. The system of claim 16 , the operations further comprising: operating on the first set of video frames using a first type of neural network; and operating on the second set of video frames using a second type of neural network, wherein the first type of convolutional neural network is different than the second type of convolutional neural network. 18. The system of claim 17 , wherein the first type of convolutional network is a deep convolutional neural network and the second type of convolutional neural network is a shallow convolutional neural network. 19. The system of claim 16 , further comprising generating, based on the concatenation, a template of key points representing the plurality of joint locations; and wherein: computing the correlation between the first set of video frames and second set of video frames comprises using a correlation filter; and generating updated pose estimation results comprises generating updated pose estimation results based on the template of key points. 20. A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a computer, cause the computer to perform operations comprising: generating, using a composite convolutional neural network, joint data representing a plurality of joint locations of a human depicted in a plurality of video frames comprising a first set of video frames and a second set of video frames, the generating of the joint data by the composite convolutional neural network comprising: generating joint data representing a plurality of joint locations of a human depicted in a plurality of video frames comprising a first set of video frames and a second set of video frames, the generating of the joint data comprising tracking the plurality of joint locations using a learner neural network that is trained to track the plurality of joint locations based on a concatenation of: feature maps corresponding to the plurality of video frames; and pose estimation results produced by a trained convolutional pose machine, the pose estimation results corresponding to the plurality of joint locations in the plurality of video frames; and generating updated pose estimation results using a correlation between the first set of video frames and the second set of video frames, the correlation being computed based on the concatenation and the second set of video frames.

Assignees

Inventors

Classifications

  • involving control of end-device applications over a network · CPC title

  • Recognition of whole body movements, e.g. for sport training · CPC title

  • using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

  • Protocols · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12165335B2 cover?
Systems, devices, media and methods are presented for a human pose tracking framework. The human pose tracking framework may identify a message with video frames, generate, using a composite convolutional neural network, joint data representing joint locations of a human depicted in the video frames, the generating of the joint data by the composite convolutional neural network done by a deep c…
Who is the assignee on this patent?
Snap Inc
What technology area does this patent fall under?
Primary CPC classification G06T7/246. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 10 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).