Skeleton-based action recognition using bi-directional spatial-temporal transformer

US11854305B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11854305-B2
Application numberUS-202117315319-A
CountryUS
Kind codeB2
Filing dateMay 9, 2021
Priority dateMay 9, 2021
Publication dateDec 26, 2023
Grant dateDec 26, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A bi-directional spatial-temporal transformer neural network (BDSTT) is trained to predict original coordinates of a skeletal joint in a specific frame through relative relationships of the skeletal joint to other joints and to the state of the skeletal joint in other frames. Obtain a plurality of frames comprising coordinates of the skeletal joint and coordinates of other joints. Produce a spatially masked frame by masking the original coordinates of the skeletal joint. Provide the specific frame, the spatially masked frame, and at least one more frame to a coordinate prediction head of the BDSTT. Obtain, from the coordinate prediction head, a prediction of coordinates for the skeletal joint. Adjust parameters of the BDSTT until a mean-squared error, between the prediction of coordinates for the skeletal joint and the original coordinates of the skeletal joint, converges.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: instantiating a bi-directional spatial-temporal transformer neural network; and training the bi-directional spatial-temporal transformer neural network to predict original coordinates of a skeletal joint in a specific frame through relative relationships of the skeletal joint to other joints and to the state of the skeletal joint in other frames by: obtaining a plurality of frames comprising coordinates of the skeletal joint and coordinates of other joints; producing a spatially masked frame from the specific frame by masking the original coordinates of the skeletal joint; providing the specific frame, the spatially masked frame, and at least one more of the plurality of frames to a coordinate prediction head of the bi-directional spatial-temporal transformer network; obtaining, from the coordinate prediction head, a prediction of coordinates for the skeletal joint in the spatially masked frame; and adjusting parameters of the bi-directional spatial-temporal transformer neural network until a mean-squared error, between the prediction of coordinates for the skeletal joint and the original coordinates of the skeletal joint, converges. 2. The method of claim 1 , further comprising: training the bi-directional spatial-temporal transformer neural network to predict a correct time order of sequential coordinates of the skeletal joint by: producing a plurality of time-shuffled frames by time-shuffling the plurality of frames; providing the plurality of time-shuffled frames to a temporal classification head along with the plurality of frames; obtaining from the temporal classification head a prediction of correct time order for the plurality of time-shuffled frames; and adjusting parameters of the bi-directional spatial-temporal transformer neural network until a cross-entropy loss, between the prediction of correct time order and the plurality of frames, converges. 3. The method of claim 2 , further comprising: detecting a skeletal joint motion sequence by applying the trained bi-directional spatial-temporal transformer neural network to a sequence of frames; and transmitting a control signal to at least one of an electromechanical device, an electrooptical device, and an electronic device, in response to the detected skeletal joint motion sequence. 4. The method of claim 1 , further comprising: training the bi-directional spatial-temporal transformer neural network to predict a correct spatial arrangement of coordinates of a plurality of skeletal joints by: producing a plurality of space-shuffled frames by spatially rearranging the plurality of joints in one or more of the frames; providing the plurality of space-shuffled frames to a spatial classification head along with the plurality of frames; obtaining from the spatial classification head a prediction of correct spatial arrangement for the plurality of joints in the plurality of space-shuffled frames; and adjusting parameters of the bi-directional spatial-temporal transformer neural network until a cross-entropy loss, between the prediction of correct spatial arrangement and the plurality of frames, converges. 5. The method of claim 4 , further comprising: detecting a skeletal joint motion sequence by applying the trained bi-directional spatial-temporal transformer neural network to a sequence of frames; and transmitting a control signal to at least one of an electromechanical device, an electrooptical device, and an electronic device, in response to the detected skeletal joint motion sequence. 6. The method of claim 1 , further comprising: training the bi-directional spatial-temporal transformer neural network to predict a correct semantic coding of a plurality of skeletal joints by: producing a semantically masked frame from the specific frame by masking at least a part of a matrix of one-hot vectors corresponding to the plurality of joints in the specific frame; providing the semantically masked frame and the specific frame to a semantic prediction head of the bi-directional spatial-temporal transformer network; obtaining from the semantic prediction head a predicted matrix of one-hot vectors for the semantically masked frame; and adjusting parameters of the bi-directional spatial-temporal transformer network until a cross-entropy classification loss, between the predicted matrix of one-hot vectors and the matrix of one-hot vectors corresponding to the plurality of joints, converges. 7. The method of claim 6 , further comprising: detecting a skeletal joint motion sequence by applying the trained bi-directional spatial-temporal transformer neural network to a sequence of frames; and transmitting a control signal to at least one of an electromechanical device, an electrooptical device, and an electronic device, in response to the detected skeletal joint motion sequence. 8. A computer program product comprising one or more non-transitory computer readable storage media that embody computer executable instructions, which when executed by a computer cause the computer to perform a method comprising: instantiating a bi-directional spatial-temporal transformer neural network; and training the bi-directional spatial-temporal transformer neural network to predict original coordinates of a skeletal joint in a specific frame through relative relationships of the skeletal joint to other joints and to the state of the skeletal joint in other frames by: obtaining a plurality of frames comprising coordinates of the skeletal joint and coordinates of other joints; producing a spatially masked frame from the specific frame by masking the original coordinates of the skeletal joint; providing the specific frame, the spatially masked frame, and at least one more of the plurality of frames to a coordinate prediction head of the bi-directional spatial-temporal transformer network; obtaining from the coordinate prediction head a prediction of coordinates for the skeletal joint in the spatially masked frame; and adjusting parameters of the bi-directional spatial-temporal transformer neural network until a mean-squared error, between the prediction of coordinates for the skeletal joint and the original coordinates of the skeletal joint, converges. 9. The computer program product of claim 8 , wherein the method further comprises: training the bi-directional spatial-temporal transformer neural network to predict a correct time order of sequential coordinates of the skeletal joint by: producing a plurality of time-shuffled frames by time-shuffling the plurality of frames; providing the plurality of time-shuffled frames to a temporal classification head along with the plurality of frames; obtaining from the temporal classification head a prediction of correct time order for the plurality of time-shuffled frames; and adjusting parameters of the bi-directional spatial-temporal transformer neural network until a cross-entropy loss, between the prediction of correct time order and the plurality of frames, converges. 10. The computer program product of claim 9 , wherein the method further comprises: detecting a skeletal joint motion sequence by applying the trained bi-directional spatial-temporal transformer neural network to a sequence of frames; and transmitting a control signal to at least one of an electromechanical device, an electrooptical device, and an electronic device, in response to the detected skeletal joint motion sequence. 11. The computer program product of claim 8 , wherein the method further comprises: training the bi-directional spatial-temporal transformer neural network to predict a correct spatial arrangement of coordinates of

Assignees

Inventors

Classifications

  • G06V40/23Primary

    Recognition of whole body movements, e.g. for sport training · CPC title

  • Arrangements for interaction with the human body, e.g. for user immersion in virtual reality (blind teaching G09B21/00) · CPC title

  • based on naturality criteria, e.g. with non-negative factorisation or negative correlation · CPC title

  • using feature-based methods, e.g. the tracking of corners or segments · CPC title

  • Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11854305B2 cover?
A bi-directional spatial-temporal transformer neural network (BDSTT) is trained to predict original coordinates of a skeletal joint in a specific frame through relative relationships of the skeletal joint to other joints and to the state of the skeletal joint in other frames. Obtain a plurality of frames comprising coordinates of the skeletal joint and coordinates of other joints. Produce a spa…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06V40/23. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 26 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).