Neural network-based action detection
US-2020074227-A1 · Mar 5, 2020 · US
US12586413B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12586413-B2 |
| Application number | US-201917754685-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 9, 2019 |
| Priority date | Oct 9, 2019 |
| Publication date | Mar 24, 2026 |
| Grant date | Mar 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A device and a method for recognizing person activity in a sequence of frames ( 100 ) comprising: obtaining a set of consecutives 3D poses ( 103 ), obtaining a feature map ( 102 ), obtaining a vector of spatiotemporal features, obtaining a matrix of spatial attention weights, obtaining a matrix of temporal attention weights ( 110 ), modulating ( 106 ) the feature map using the matrix of spatial attention weights to obtain a spatially modulated feature map, modulating ( 111 ) the feature map using the vector of temporal attention weights to obtain a temporally modulated feature map, performing a convolution ( 114 ) of the spatially modulated feature map and of the temporally modulated feature map to obtain a convoluted feature map, performing a classification ( 115 ) using the convoluted feature map so as to determine the activity of the person in the video.
Opening claim text (preview).
The invention claimed is: 1 . A method for recognizing person activity in a video comprising a sequence of frames, each frame showing at least a portion of the person, the method comprising: obtaining a set of consecutive 3D poses of the person using the sequence of frames, each of the consecutive 3D poses illustrating a posture of the person from a frame of the sequence of frames, and each of the consecutive 3D poses being associated with an instant in the sequence of frames, obtaining a feature map elaborated using a first encoder neural network configured to receive the sequence of frames as input and to output the feature map having dimensions associated with time, space, and a number of channels, obtaining a vector of spatiotemporal features using a second recurrent neural network configured to receive the set of consecutive 3D poses of the person as input, a third neural network receiving the vector of spatiotemporal features as input and outputting a matrix of spatial attention weights, wherein each weight indicates an importance of a location in the matrix, wherein the third neural network comprises a first fully connected layer, a hyperbolic tangent layer, a second fully connected layer, and a sigmoid layer, a fourth neural network, different from the third neural network, the fourth neural network receiving the vector of spatiotemporal features as input and outputting a matrix of temporal attention weights, wherein each weight indicates a saliency of an instant in the sequence of frames, wherein the fourth neural network comprises a first fully connected layer, a hyperbolic tangent layer, a second fully connected layer, and a Softmax layer, obtaining a spatially-modulated feature map by modulating the feature map using the matrix of spatial attention weights, obtaining a temporally-modulated feature map, different from the spatially-modulated feature map, by modulating the feature map using the matrix of temporal attention weights, performing a convolution of the spatially modulated feature map and of the temporally modulated feature map to obtain a convoluted feature map, performing a classification using the convoluted feature map so as to determine the activity of the person in the video. 2 . The method according to claim 1 , wherein the first encoder neural network includes a portion of an inflated 3D convolutional neural network. 3 . The method according to claim 1 , further comprising, prior to the performing the convolution: performing a Global Average Pooling on the spatially-modulated feature map; and performing a Global Average Pooling on the temporally modulated feature map. 4 . The method according to claim 1 , wherein the performing the convolution comprises performing a 1×1×1 convolution. 5 . The method according to claim 1 , wherein the performing the classification comprises using a Softmax function. 6 . The method according to claim 1 , wherein each of the consecutive 3D poses comprises a set of 3D coordinates (x_j) indicating positions of joints of a given skeleton. 7 . The method according to claim 1 , further comprising: a preliminary training step of at least one of the first encoder neural network, the second recurrent neural network, the third neural network, and the fourth neural network. 8 . The method according to claim 1 , further comprising: a preliminary training step comprising: determining a loss using a cross-entropy loss, determining a loss based on the matrix of spatial attention weights, and determining a loss based on the matrix of temporal attention weights. 9 . A device for recognizing person activity in a video comprising a sequence of frames, each frame showing at least a portion of the person, the device comprising: a module for obtaining a set of consecutive 3D poses of the person using the sequence of frames, each of the consecutive 3D poses illustrating a posture of the person from a frame of the sequence of frames, and each of the consecutive 3D poses being associated with an instant in the sequence of frames, a first encoder neural network configured to receive the sequence of frames as input and to output the feature map having dimensions associated with time, space, and a number of channels, a second neural network configured to receive the set of consecutive 3D poses of the person as input and to output a vector of spatiotemporal features, a third neural network configured to receive the vector of spatiotemporal features as input and to output a matrix of spatial attention weights, wherein each weight indicates an importance of a location in the matrix, wherein the third neural network comprises a first fully connected layer, a hyperbolic tangent layer, a second fully connected layer, and a sigmoid layer, a fourth neural network, different from the third neural network, configured to receive the vector of spatiotemporal features as input and to output a matrix of temporal attention weights, wherein each weight indicates a saliency of an instant in the sequence of frames, wherein the fourth neural network comprises a first fully connected layer, a hyperbolic tangent layer, a second fully connected layer, and a Softmax layer, a module for obtaining a spatially-modulated feature map by modulating the feature map using the matrix of spatial attention weights, a module for obtaining a temporally-modulated feature map, different from the spatially-modulated feature map, by modulating the feature map using the matrix of temporal attention weights, a module for performing a convolution of the spatially modulated feature map and of the temporally modulated feature map to obtain a convoluted feature map, a module for performing a classification using the convoluted feature map so as to determine the activity of the person in the video. 10 . A system comprising the device of claim 9 and comprising a video acquisition module configured to obtain the video. 11 . A non-transitory computer-readable medium comprising instructions stored thereon that when executed by a processor cause the processor to execute instructions for executing the steps of the method according to claim 1 .
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Combinations of networks · CPC title
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.