Method for recognizing activities using separate spatial and temporal attention weights

US12586413B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12586413-B2
Application numberUS-201917754685-A
CountryUS
Kind codeB2
Filing dateOct 9, 2019
Priority dateOct 9, 2019
Publication dateMar 24, 2026
Grant dateMar 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A device and a method for recognizing person activity in a sequence of frames ( 100 ) comprising: obtaining a set of consecutives 3D poses ( 103 ), obtaining a feature map ( 102 ), obtaining a vector of spatiotemporal features, obtaining a matrix of spatial attention weights, obtaining a matrix of temporal attention weights ( 110 ), modulating ( 106 ) the feature map using the matrix of spatial attention weights to obtain a spatially modulated feature map, modulating ( 111 ) the feature map using the vector of temporal attention weights to obtain a temporally modulated feature map, performing a convolution ( 114 ) of the spatially modulated feature map and of the temporally modulated feature map to obtain a convoluted feature map, performing a classification ( 115 ) using the convoluted feature map so as to determine the activity of the person in the video.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A method for recognizing person activity in a video comprising a sequence of frames, each frame showing at least a portion of the person, the method comprising: obtaining a set of consecutive 3D poses of the person using the sequence of frames, each of the consecutive 3D poses illustrating a posture of the person from a frame of the sequence of frames, and each of the consecutive 3D poses being associated with an instant in the sequence of frames, obtaining a feature map elaborated using a first encoder neural network configured to receive the sequence of frames as input and to output the feature map having dimensions associated with time, space, and a number of channels, obtaining a vector of spatiotemporal features using a second recurrent neural network configured to receive the set of consecutive 3D poses of the person as input, a third neural network receiving the vector of spatiotemporal features as input and outputting a matrix of spatial attention weights, wherein each weight indicates an importance of a location in the matrix, wherein the third neural network comprises a first fully connected layer, a hyperbolic tangent layer, a second fully connected layer, and a sigmoid layer, a fourth neural network, different from the third neural network, the fourth neural network receiving the vector of spatiotemporal features as input and outputting a matrix of temporal attention weights, wherein each weight indicates a saliency of an instant in the sequence of frames, wherein the fourth neural network comprises a first fully connected layer, a hyperbolic tangent layer, a second fully connected layer, and a Softmax layer, obtaining a spatially-modulated feature map by modulating the feature map using the matrix of spatial attention weights, obtaining a temporally-modulated feature map, different from the spatially-modulated feature map, by modulating the feature map using the matrix of temporal attention weights, performing a convolution of the spatially modulated feature map and of the temporally modulated feature map to obtain a convoluted feature map, performing a classification using the convoluted feature map so as to determine the activity of the person in the video. 2 . The method according to claim 1 , wherein the first encoder neural network includes a portion of an inflated 3D convolutional neural network. 3 . The method according to claim 1 , further comprising, prior to the performing the convolution: performing a Global Average Pooling on the spatially-modulated feature map; and performing a Global Average Pooling on the temporally modulated feature map. 4 . The method according to claim 1 , wherein the performing the convolution comprises performing a 1×1×1 convolution. 5 . The method according to claim 1 , wherein the performing the classification comprises using a Softmax function. 6 . The method according to claim 1 , wherein each of the consecutive 3D poses comprises a set of 3D coordinates (x_j) indicating positions of joints of a given skeleton. 7 . The method according to claim 1 , further comprising: a preliminary training step of at least one of the first encoder neural network, the second recurrent neural network, the third neural network, and the fourth neural network. 8 . The method according to claim 1 , further comprising: a preliminary training step comprising: determining a loss using a cross-entropy loss, determining a loss based on the matrix of spatial attention weights, and determining a loss based on the matrix of temporal attention weights. 9 . A device for recognizing person activity in a video comprising a sequence of frames, each frame showing at least a portion of the person, the device comprising: a module for obtaining a set of consecutive 3D poses of the person using the sequence of frames, each of the consecutive 3D poses illustrating a posture of the person from a frame of the sequence of frames, and each of the consecutive 3D poses being associated with an instant in the sequence of frames, a first encoder neural network configured to receive the sequence of frames as input and to output the feature map having dimensions associated with time, space, and a number of channels, a second neural network configured to receive the set of consecutive 3D poses of the person as input and to output a vector of spatiotemporal features, a third neural network configured to receive the vector of spatiotemporal features as input and to output a matrix of spatial attention weights, wherein each weight indicates an importance of a location in the matrix, wherein the third neural network comprises a first fully connected layer, a hyperbolic tangent layer, a second fully connected layer, and a sigmoid layer, a fourth neural network, different from the third neural network, configured to receive the vector of spatiotemporal features as input and to output a matrix of temporal attention weights, wherein each weight indicates a saliency of an instant in the sequence of frames, wherein the fourth neural network comprises a first fully connected layer, a hyperbolic tangent layer, a second fully connected layer, and a Softmax layer, a module for obtaining a spatially-modulated feature map by modulating the feature map using the matrix of spatial attention weights, a module for obtaining a temporally-modulated feature map, different from the spatially-modulated feature map, by modulating the feature map using the matrix of temporal attention weights, a module for performing a convolution of the spatially modulated feature map and of the temporally modulated feature map to obtain a convoluted feature map, a module for performing a classification using the convoluted feature map so as to determine the activity of the person in the video. 10 . A system comprising the device of claim 9 and comprising a video acquisition module configured to obtain the video. 11 . A non-transitory computer-readable medium comprising instructions stored thereon that when executed by a processor cause the processor to execute instructions for executing the steps of the method according to claim 1 .

Assignees

Inventors

Classifications

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Combinations of networks · CPC title

  • Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title

  • Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12586413B2 cover?
A device and a method for recognizing person activity in a sequence of frames ( 100 ) comprising: obtaining a set of consecutives 3D poses ( 103 ), obtaining a feature map ( 102 ), obtaining a vector of spatiotemporal features, obtaining a matrix of spatial attention weights, obtaining a matrix of temporal attention weights ( 110 ), modulating ( 106 ) the feature map using the matrix o…
Who is the assignee on this patent?
Toyota Motor Europe, Toyota Motor Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06V40/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).