Generating an avatar from real time image data
US-9508197-B2 · Nov 29, 2016 · US
US11315259B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11315259-B2 |
| Application number | US-202016949594-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 5, 2020 |
| Priority date | Nov 30, 2018 |
| Publication date | Apr 26, 2022 |
| Grant date | Apr 26, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems, devices, media and methods are presented for a human pose tracking framework. The human pose tracking framework may identify a message with video frames, generate, using a composite convolutional neural network, joint data representing joint locations of a human depicted in the video frames, the generating of the joint data by the composite convolutional neural network done by a deep convolutional neural network operating on one portion of the video frames, a shallow convolutional neural network operating on a another portion of the video frames, and tracking the joint locations using a one-shot learner neural network that is trained to track the joint locations based on a concatenation of feature maps and a convolutional pose machine. The human pose tracking framework may store, the joint locations, and cause presentation of a rendition of the joint locations on a user interface of a client device.
Opening claim text (preview).
What is claimed is: 1. A method comprising: identifying, using one or more processors, a multimodal message comprising a plurality of video frames, the plurality of video frames comprising a first set of video frames and a second set of video frames; generating, using a composite convolutional neural network, joint data representing a plurality of joint locations of a human depicted in the plurality of video frames, the generating of the joint data by the composite convolutional neural network comprising: operating on the first set of video frames using a deep convolutional neural network; operating on the second set of video frames using a shallow convolutional neural network; and tracking the plurality of joint locations using a one-shot learner neural network that is trained to track the plurality of joint locations based on a concatenation of: feature maps comprising temporal information corresponding to the plurality of video frames; and a convolutional pose machine trained to produce pose estimation results corresponding to the plurality of joint locations in the plurality of video frames; generating, based on the concatenating, a template of key points representing the plurality of joint locations; generating updated pose estimation results using a correlation filter trained to compute a correlation between the first set of video frames and the second set of video frames using the template of key points and the second set of video frames; storing, using the one or more processors, the updated pose estimation results of the human depicted in the plurality of video frames; and causing presentation of a rendition of the updated pose estimation results of the human on a user interface of a client device. 2. The method of claim 1 wherein the feature maps are produced by the deep convolutional neural network and the shallow convolutional neural network. 3. The method of claim 1 wherein the first set of video frames comprises an initial video frame and the second set of video frames comprises subsequent video frames which follow the initial video frame. 4. The method of claim 1 wherein a number of layers in the deep convolutional neural network is at least five. 5. The method of claim 1 wherein the one-shot learner neural network directly outputs the template of key points. 6. The method of claim 5 wherein the one-shot learner neural network outputs the template of key points to the correlation filter. 7. The method of claim 1 , wherein the rendition is a character icon of the human in the plurality of video frames. 8. A system comprising: a processor; and a memory storing instructions that, when executed by the processor, configure the system to perform operations comprising: identifying a multimodal message comprising a plurality of video frames, the plurality of video frames comprising a first set of video frames and a second set of video frames; generating, using a composite convolutional neural network, joint data representing a plurality of joint locations of a human depicted in the plurality of video frames, the generating of the joint data by the composite convolutional neural network comprising: operating on the first set of video frames using a deep convolutional neural network; operating on the second set of video frames using a shallow convolutional neural network; and tracking the plurality of joint locations using a one-shot learner neural network that is trained to track the plurality of joint locations based on a concatenation of: feature maps comprising temporal information corresponding to the plurality of video frames; and a convolutional pose machine trained to produce pose estimation results corresponding to the plurality of joint locations in the plurality of video frames; generating, based on the concatenating, a template of key points representing the plurality of joint locations; generating updated pose estimation results using a correlation filter trained to compute a correlation between the first set of video frames and the second set of video frames using the template of key points and the second set of video frames; storing the updated pose estimation results of the human depicted in the plurality of video frames; and causing presentation of a rendition of the updated pose estimation results of the human on a user interface of a client device. 9. The system of claim 8 wherein the feature maps are produced by the deep convolutional neural network and the shallow convolutional network. 10. The system of claim 8 wherein the first set of video frames comprises an initial video frame and the second set of video frames comprises subsequent video frames which follow the initial video frame. 11. The system of claim 8 wherein a number of layers in the deep convolutional neural network is at least five. 12. The system of claim 8 wherein the deep convolutional neural network contains at least one deconvolution layer. 13. The system of claim 8 wherein the one-shot learner neural network directly outputs a plurality of correlation filters. 14. The system of claim 8 , wherein the rendition is a character icon of a user associated with the client device. 15. A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a computer, cause the computer to perform operations comprising: identifying a multimodal message comprising a plurality of video frames, the plurality of video frames comprising a first set of video frames and a second set of video frames; generating, using a composite convolutional neural network, joint data representing a plurality of joint locations of a human depicted in the plurality of video frames, the generating of the joint data by the composite convolutional neural network comprising: operating on the first set of video frames using a deep convolutional neural network; operating on the second set of video frames using a shallow convolutional neural network; and tracking the plurality of joint locations using a one-shot learner neural network that is trained to track the plurality of joint locations based on a concatenation of: feature maps comprising temporal information corresponding to the plurality of video frames; and a convolutional pose machine trained to produce pose estimation results corresponding to the plurality of joint locations in the plurality of video frames; generating, based on the concatenating, a template of key points representing the plurality of joint locations; generating updated pose estimation results using a correlation filter trained to compute a correlation between the first set of video frames and the second set of video frames using the template of key points and the second set of video frames; storing the updated pose estimation results of the human depicted in the plurality of video frames; and causing presentation of a rendition of the updated pose estimation results of the human on a user interface of a client device. 16. The non-transitory computer-readable storage medium of claim 15 wherein the feature maps are produced by the deep convolutional neural network and the shallow convolutional network. 17. The non-transitory computer-readable storage medium of claim 15 wherein the first set of video frames comprises an initial video frame and the second set of video frames comprises subsequent video frames which follow the initial video frame. 18. The non-transitory computer-readable storage medium of claim 15 wherein a number of layers in the deep convo
Recognition of whole body movements, e.g. for sport training · CPC title
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
involving control of end-device applications over a network · CPC title
specially adapted for the location of the user terminal · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.