Method, system, and medium for identifying human behavior in a digital video using convolutional neural networks

US11625646B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11625646-B2
Application numberUS-202016841227-A
CountryUS
Kind codeB2
Filing dateApr 6, 2020
Priority dateApr 6, 2020
Publication dateApr 11, 2023
Grant dateApr 11, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method, processing system and processor-readable medium for classifying human behavior based on a sequence of frames of a digital video. A 2D convolutional neural network is used to identify key points on a human body, such as human body joints, visible within each frame. An encoded representation of the key points is created for each video frame. The sequence of encoded representations corresponding to the sequence of frames is processed by a 3D CNN trained to identify human behaviors based on key point positions varying over time.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method, carried out by a processor executing computer program instructions, comprising: receiving at least one key point position set for a frame of a sequence of frames, the at least one key point position set including a key point position for each key point of a human body detected in the frame, each key point position corresponding to a location of a joint of the human body; generating an encoded representation for each key point position set of the at least one key point position set for the frame, each encoded representation comprising: an X matrix having a plurality of X pixel coordinates for the plurality of key point positions in the key point position set, a first X pixel coordinate and second X pixel coordinate being positioned within the matrix relative to each other based on a proximity relationship or movement relationship between a first joint of the human body and a second joint of the human body corresponding to the first X pixel coordinate and second X pixel coordinate respectively; and a Y matrix having a plurality of Y pixel coordinates for the plurality of key point positions in the key point position set, a first Y pixel coordinate and second Y pixel coordinate being positioned within the matrix relative to each other based on a proximity relationship or movement relationship between a first joint of the human body and a second joint of the human body corresponding to the first Y pixel coordinate and second Y pixel coordinate respectively; and providing the encoded representation for each of the at least one key point position set for the frame to a human behaviour classifier that includes a machine learned model that is configured to identify a behaviour of the human body based on the encoded representation for each key point position set and output the identified behavior of the human body. 2. The method of claim 1 , further comprising: receiving a plurality of key point position sets, each key point position set correspond to one frame in the sequence of frames; and generating an encoded representation for each key point position set of the plurality of key point position sets; and providing the encoded representation to the human behaviour classifier that includes the machine learned model that is configured to identity a human behaviour based on the plurality of encoded representations and output the identified behavior of the human body. 3. The method of claim 2 , further comprising: receiving the sequence of frames; and processing each respective frame in the sequence of frames to generate the key point position set corresponding to the respective frame. 4. The method of claim 3 , wherein the key point position set is generated using a key points identifier configured to receive a bounding box for the human body comprising one or more pixel values of a plurality of pixels of the respective frame, process the bounding box to identify key points within the bounding box and generate a key point position for each key point, and generate the key point position set that includes the key point position for each key point identified in the frame. 5. The method of claim 1 , wherein each encoded representation further comprises: a Z matrix having a plurality of Z depth coordinates for the plurality of key point positions in the key point position set, a first Z depth coordinate and second Z coordinate being positioned within the matrix relative to each other based on a proximity relationship or movement relationship between a first joint of the human body and a second joint of the human body corresponding to the first Z coordinate and second Z coordinate respectively. 6. A processing system, comprising: a processor; and a memory having stored thereon executable instructions that, when executed by the processor, cause the device to: receive at least one key point position set for a frame of a sequence of frame, the at least one key point position set including a key point position for each key point of a human body detected in the frame, each key point position corresponding to a location of the key point on the human body; generate an encoded representation for each key point position set of the at least one key point position set for the frame, each encoded representation comprising: an X matrix having a plurality of X pixel coordinates for the plurality of key point positions in the key point position set, a first X pixel coordinate and second X pixel coordinate being positioned within the matrix relative to each other based on a proximity relationship or movement relationship between a first joint of the human body and a second joint of the human body corresponding to the first X pixel coordinate and second X pixel coordinate respectively; and a Y matrix having a plurality of Y pixel coordinates for the plurality of key point positions in the key point position set, a first Y pixel coordinate and second Y pixel coordinate being positioned within the matrix relative to each other based on a proximity relationship or movement relationship between a first joint of the human body and a second joint of the human body corresponding to the first Y pixel coordinate and second Y pixel coordinate respectively; and provide the encoded representation for each of the at least one key point position set for the frame to a human behaviour classifier that includes a machine learned model that is configured to identify a behaviour of the human body based on the encoded representation for each key point position set and output the identified behavior of the human body. 7. The processing system of claim 6 , wherein the executable instructions, when executed by the processor, further cause the device to: receive a plurality of key point position sets, each key point position set correspond to one frame in the sequence of frames; and generate an encoded representation for each key point position set of the plurality of key point position sets; and provide the encoded representation to the human behaviour classifier that includes the machine learned model that is configured to identity a human behaviour based on the plurality of encoded representations and output the identified behavior of the human body. 8. The processing system of claim 7 , wherein the executable instructions, when executed by the processor, further cause the device to: receive the sequence of frames; and for each frame of the sequence of frames, generate the key point position set corresponding to the frame. 9. The processing system of claim 6 , wherein the encoded representation is a matrix representation and wherein the machine learned model is a matrix machine learned model, and wherein each key point position corresponds to a joint of the human body. 10. The processing system of claim 6 , wherein each encoded representation further comprises: a Z matrix having a plurality of Z depth coordinates for the plurality of key point positions in the key point position set, a first Z depth coordinate and second Z coordinate being positioned within the matrix relative to each other based on a proximity relationship or movement relationship between a first joint of the human body and a second joint of the human body corresponding to the first Z coordinate and second Z coordinate respectively. 11. A non-transitory processor-readable medium containing instructions which, when executed by a processor of a processing system cause the processing system to: receive at least one key point position set for a frame of a sequence of frame, the at least one key point position set including a key point position for each key point of a human body detected in the frame, each key point positio

Assignees

Inventors

Classifications

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Supervised learning · CPC title

  • Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title

  • using neural networks · CPC title

  • Extraction of image or video features · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11625646B2 cover?
A method, processing system and processor-readable medium for classifying human behavior based on a sequence of frames of a digital video. A 2D convolutional neural network is used to identify key points on a human body, such as human body joints, visible within each frame. An encoded representation of the key points is created for each video frame. The sequence of encoded representations corre…
Who is the assignee on this patent?
Huawei Cloud Computing Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 11 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).