Action recognition with high-order interaction through spatial-temporal object tracking
US-2021081673-A1 · Mar 18, 2021 · US
US11954910B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11954910-B2 |
| Application number | US-202017134315-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 26, 2020 |
| Priority date | Dec 26, 2020 |
| Publication date | Apr 9, 2024 |
| Grant date | Apr 9, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, apparatus, and systems for multi-resolution processing for video classification. A plurality of video frames of a video are obtained and a resolution for classifying each video frame of the plurality of video frames is determined by analyzing each video frame using a policy network. Based on the determined resolution, each video frame having a determined resolution is rescaled and each rescaled video frame is routed to a classifier of a backbone network that corresponds to the determined resolution. Each rescaled video frame is classified using the corresponding classifier of the backbone network to obtain a plurality of classifications and the classifications are averaged to determine an action classification of the video.
Opening claim text (preview).
What is claimed is: 1. A method comprising: obtaining a plurality of video frames of a video; determining a resolution targeted for action classification for classifying each video frame of the plurality of video frames by analyzing each video frame using a policy network, wherein the policy network has a feature extractor and is trained to determine the resolution targeted to action classification; rescaling, based on the determined resolution targeted for action classification, each video frame; routing each rescaled video frame to a classifier of a backbone network, wherein the classifier routed to corresponds to the determined resolution; classifying each rescaled video frame using the corresponding classifier of the backbone network to obtain a plurality of classifications; and averaging the classifications to determine an action classification of the video. 2. The method of claim 1 , further comprising jointly training the policy network and a recognition model of the backbone network using standard backpropagation. 3. The method of claim 2 , further comprising: sampling policies from a Gumbel Softmax distribution, enabling optimization of the policy network via the standard backpropagation; and optimizing the policy network based on the sampled policies. 4. The method of claim 1 , wherein the policy network contains a long short-term memory (LSTM) module that performs the determination of the resolution. 5. The method of claim 1 , wherein given a hidden state, the policy network estimates a policy distribution and samples an action at a t ∈Ω={0, 1, . . . L+M−1} via a Gumbel Softmax operation: a t ˜GUMBEL( h t ,θ G ) wherein, when a t <L, the video frame is rescaled to spatial resolution 3×H a t ×W a t and forwarded to the corresponding backbone network ψ a t (⋅; θ ψa t ) to get a frame-level prediction, y t a t =ψ a t ( I t a t ;θ ψa t ) wherein I t a t ∈ ℝ 3 × H a t × W a t is the rescaled video frame and y t a t ∈ C is the frame-level prediction; and wherein, when the action a t >=L, the video frame is skipped for prediction and a subsequent (F a t −L −1) frames are skipped by the policy network, wherein a t represents an action a t time t, L is a number of resolutions, M is a number of frames in a skipping sequence, (H x , W x ) is a frame resolution, h t is a hidden state, and θ G denotes learnable parameters, and ψ a t (⋅; θ ψa t ) is a corresponding backbone network. 6. The method of claim 1 , further comprising substituting a differentiable sample from a corresponding Gumbel-Softmax distribution for an original non-differentiable sample from a discrete distribution. 7. The method of claim 6 , further comprising generating, a t each time step t, a logits z∈ L+M−1 from hidden states h t by a fully-connected layer z=FC(h t , θ FC ) and using a Softmax to generate a categorical distribution π t , π t = { p i ❘ p i = exp ( z i ) ∑ j = 0 L + M - 1 exp ( z j ) } wherein discrete samples from a categorical distribution are drawn as follows: {circumflex over (p)} =arg max i (log p i +G i ) where G i =−log (−logU i ) is a standard Gumbel distribution with U i sampled from a uniform i.i.d. distribution Unif(0, 1). 8. The method of claim 7 , further comprising approximating, during a backward pass, a gradient of the discrete samples by computing a gradient of a continuous softmax relaxation. 9. The method of claim 1 , further comprising measuring a classification quality, during training, using a standard cross-entropy loss based on: acc = [−y log( ( V ;Θ))] where Θ={θ Φ ,θ LSTM , θ G , θ ψL−2 } and (V; y) is a training video sample with an associated one-hot encoded label vector, wherein V is a video and is a classifier. 10. The method of claim 1 , wherein a prediction for a video frame having a lowest resolution is only processed once by the policy network. 11. A non-transitory computer readable medium comprising computer executable instructions which when executed by a computer cause the computer to perform the method of: obtaining a plurality of video frames of a video; determining a resolution targeted for action classification for classifying each video frame of the plurality of video frames by analyzing each video frame using a policy network, wherein the policy network has a feature extractor and is trained to determine the resolution targeted to action classification; rescaling, based on the determined resolution, each video frame; routing each rescaled video frame to a classifier of a backbone network, wherein the classifier routed to corresponds to the determined resolution; classifying each rescaled video frame using the corresponding classifier of the backbone network to obtain a plurality of classifications; and averaging the classifications to determine an action classification of the video. 12. An apparatus comprising: a memory; and at least one processor, coupled to said memory, and operative to perform operations comprising: obtaining a plurality of video frames of a video; determining a resolution targeted for action classification for classifying each video frame of the plurality of video frames by analyzing each video frame using a policy network, wherein the
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
Validation; Performance evaluation; Active pattern learning techniques · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.