Action classification using deep embedded clustering
US-2021012100-A1 · Jan 14, 2021 · US
US11854305B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11854305-B2 |
| Application number | US-202117315319-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 9, 2021 |
| Priority date | May 9, 2021 |
| Publication date | Dec 26, 2023 |
| Grant date | Dec 26, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A bi-directional spatial-temporal transformer neural network (BDSTT) is trained to predict original coordinates of a skeletal joint in a specific frame through relative relationships of the skeletal joint to other joints and to the state of the skeletal joint in other frames. Obtain a plurality of frames comprising coordinates of the skeletal joint and coordinates of other joints. Produce a spatially masked frame by masking the original coordinates of the skeletal joint. Provide the specific frame, the spatially masked frame, and at least one more frame to a coordinate prediction head of the BDSTT. Obtain, from the coordinate prediction head, a prediction of coordinates for the skeletal joint. Adjust parameters of the BDSTT until a mean-squared error, between the prediction of coordinates for the skeletal joint and the original coordinates of the skeletal joint, converges.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: instantiating a bi-directional spatial-temporal transformer neural network; and training the bi-directional spatial-temporal transformer neural network to predict original coordinates of a skeletal joint in a specific frame through relative relationships of the skeletal joint to other joints and to the state of the skeletal joint in other frames by: obtaining a plurality of frames comprising coordinates of the skeletal joint and coordinates of other joints; producing a spatially masked frame from the specific frame by masking the original coordinates of the skeletal joint; providing the specific frame, the spatially masked frame, and at least one more of the plurality of frames to a coordinate prediction head of the bi-directional spatial-temporal transformer network; obtaining, from the coordinate prediction head, a prediction of coordinates for the skeletal joint in the spatially masked frame; and adjusting parameters of the bi-directional spatial-temporal transformer neural network until a mean-squared error, between the prediction of coordinates for the skeletal joint and the original coordinates of the skeletal joint, converges. 2. The method of claim 1 , further comprising: training the bi-directional spatial-temporal transformer neural network to predict a correct time order of sequential coordinates of the skeletal joint by: producing a plurality of time-shuffled frames by time-shuffling the plurality of frames; providing the plurality of time-shuffled frames to a temporal classification head along with the plurality of frames; obtaining from the temporal classification head a prediction of correct time order for the plurality of time-shuffled frames; and adjusting parameters of the bi-directional spatial-temporal transformer neural network until a cross-entropy loss, between the prediction of correct time order and the plurality of frames, converges. 3. The method of claim 2 , further comprising: detecting a skeletal joint motion sequence by applying the trained bi-directional spatial-temporal transformer neural network to a sequence of frames; and transmitting a control signal to at least one of an electromechanical device, an electrooptical device, and an electronic device, in response to the detected skeletal joint motion sequence. 4. The method of claim 1 , further comprising: training the bi-directional spatial-temporal transformer neural network to predict a correct spatial arrangement of coordinates of a plurality of skeletal joints by: producing a plurality of space-shuffled frames by spatially rearranging the plurality of joints in one or more of the frames; providing the plurality of space-shuffled frames to a spatial classification head along with the plurality of frames; obtaining from the spatial classification head a prediction of correct spatial arrangement for the plurality of joints in the plurality of space-shuffled frames; and adjusting parameters of the bi-directional spatial-temporal transformer neural network until a cross-entropy loss, between the prediction of correct spatial arrangement and the plurality of frames, converges. 5. The method of claim 4 , further comprising: detecting a skeletal joint motion sequence by applying the trained bi-directional spatial-temporal transformer neural network to a sequence of frames; and transmitting a control signal to at least one of an electromechanical device, an electrooptical device, and an electronic device, in response to the detected skeletal joint motion sequence. 6. The method of claim 1 , further comprising: training the bi-directional spatial-temporal transformer neural network to predict a correct semantic coding of a plurality of skeletal joints by: producing a semantically masked frame from the specific frame by masking at least a part of a matrix of one-hot vectors corresponding to the plurality of joints in the specific frame; providing the semantically masked frame and the specific frame to a semantic prediction head of the bi-directional spatial-temporal transformer network; obtaining from the semantic prediction head a predicted matrix of one-hot vectors for the semantically masked frame; and adjusting parameters of the bi-directional spatial-temporal transformer network until a cross-entropy classification loss, between the predicted matrix of one-hot vectors and the matrix of one-hot vectors corresponding to the plurality of joints, converges. 7. The method of claim 6 , further comprising: detecting a skeletal joint motion sequence by applying the trained bi-directional spatial-temporal transformer neural network to a sequence of frames; and transmitting a control signal to at least one of an electromechanical device, an electrooptical device, and an electronic device, in response to the detected skeletal joint motion sequence. 8. A computer program product comprising one or more non-transitory computer readable storage media that embody computer executable instructions, which when executed by a computer cause the computer to perform a method comprising: instantiating a bi-directional spatial-temporal transformer neural network; and training the bi-directional spatial-temporal transformer neural network to predict original coordinates of a skeletal joint in a specific frame through relative relationships of the skeletal joint to other joints and to the state of the skeletal joint in other frames by: obtaining a plurality of frames comprising coordinates of the skeletal joint and coordinates of other joints; producing a spatially masked frame from the specific frame by masking the original coordinates of the skeletal joint; providing the specific frame, the spatially masked frame, and at least one more of the plurality of frames to a coordinate prediction head of the bi-directional spatial-temporal transformer network; obtaining from the coordinate prediction head a prediction of coordinates for the skeletal joint in the spatially masked frame; and adjusting parameters of the bi-directional spatial-temporal transformer neural network until a mean-squared error, between the prediction of coordinates for the skeletal joint and the original coordinates of the skeletal joint, converges. 9. The computer program product of claim 8 , wherein the method further comprises: training the bi-directional spatial-temporal transformer neural network to predict a correct time order of sequential coordinates of the skeletal joint by: producing a plurality of time-shuffled frames by time-shuffling the plurality of frames; providing the plurality of time-shuffled frames to a temporal classification head along with the plurality of frames; obtaining from the temporal classification head a prediction of correct time order for the plurality of time-shuffled frames; and adjusting parameters of the bi-directional spatial-temporal transformer neural network until a cross-entropy loss, between the prediction of correct time order and the plurality of frames, converges. 10. The computer program product of claim 9 , wherein the method further comprises: detecting a skeletal joint motion sequence by applying the trained bi-directional spatial-temporal transformer neural network to a sequence of frames; and transmitting a control signal to at least one of an electromechanical device, an electrooptical device, and an electronic device, in response to the detected skeletal joint motion sequence. 11. The computer program product of claim 8 , wherein the method further comprises: training the bi-directional spatial-temporal transformer neural network to predict a correct spatial arrangement of coordinates of
Recognition of whole body movements, e.g. for sport training · CPC title
Arrangements for interaction with the human body, e.g. for user immersion in virtual reality (blind teaching G09B21/00) · CPC title
based on naturality criteria, e.g. with non-negative factorisation or negative correlation · CPC title
using feature-based methods, e.g. the tracking of corners or segments · CPC title
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.