Dynamic multi-resolution processing for video classification

US11954910B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11954910-B2
Application numberUS-202017134315-A
CountryUS
Kind codeB2
Filing dateDec 26, 2020
Priority dateDec 26, 2020
Publication dateApr 9, 2024
Grant dateApr 9, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, apparatus, and systems for multi-resolution processing for video classification. A plurality of video frames of a video are obtained and a resolution for classifying each video frame of the plurality of video frames is determined by analyzing each video frame using a policy network. Based on the determined resolution, each video frame having a determined resolution is rescaled and each rescaled video frame is routed to a classifier of a backbone network that corresponds to the determined resolution. Each rescaled video frame is classified using the corresponding classifier of the backbone network to obtain a plurality of classifications and the classifications are averaged to determine an action classification of the video.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: obtaining a plurality of video frames of a video; determining a resolution targeted for action classification for classifying each video frame of the plurality of video frames by analyzing each video frame using a policy network, wherein the policy network has a feature extractor and is trained to determine the resolution targeted to action classification; rescaling, based on the determined resolution targeted for action classification, each video frame; routing each rescaled video frame to a classifier of a backbone network, wherein the classifier routed to corresponds to the determined resolution; classifying each rescaled video frame using the corresponding classifier of the backbone network to obtain a plurality of classifications; and averaging the classifications to determine an action classification of the video. 2. The method of claim 1 , further comprising jointly training the policy network and a recognition model of the backbone network using standard backpropagation. 3. The method of claim 2 , further comprising: sampling policies from a Gumbel Softmax distribution, enabling optimization of the policy network via the standard backpropagation; and optimizing the policy network based on the sampled policies. 4. The method of claim 1 , wherein the policy network contains a long short-term memory (LSTM) module that performs the determination of the resolution. 5. The method of claim 1 , wherein given a hidden state, the policy network estimates a policy distribution and samples an action at a t ∈Ω={0, 1, . . . L+M−1} via a Gumbel Softmax operation: a t ˜GUMBEL( h t ,θ G ) wherein, when a t <L, the video frame is rescaled to spatial resolution 3×H a t ×W a t and forwarded to the corresponding backbone network ψ a t (⋅; θ ψa t ) to get a frame-level prediction, y t a t =ψ a t ( I t a t ;θ ψa t ) wherein I t a t ∈ ℝ 3 × H a t × W a t is the rescaled video frame and y t a t ∈ C is the frame-level prediction; and wherein, when the action a t >=L, the video frame is skipped for prediction and a subsequent (F a t −L −1) frames are skipped by the policy network, wherein a t represents an action a t time t, L is a number of resolutions, M is a number of frames in a skipping sequence, (H x , W x ) is a frame resolution, h t is a hidden state, and θ G denotes learnable parameters, and ψ a t (⋅; θ ψa t ) is a corresponding backbone network. 6. The method of claim 1 , further comprising substituting a differentiable sample from a corresponding Gumbel-Softmax distribution for an original non-differentiable sample from a discrete distribution. 7. The method of claim 6 , further comprising generating, a t each time step t, a logits z∈ L+M−1 from hidden states h t by a fully-connected layer z=FC(h t , θ FC ) and using a Softmax to generate a categorical distribution π t , π t = { p i ❘ p i = exp ⁡ ( z i ) ∑ j = 0 L + M - 1 ⁢ exp ⁡ ( z j ) } wherein discrete samples from a categorical distribution are drawn as follows: {circumflex over (p)} =arg max i (log p i +G i ) where G i =−log (−logU i ) is a standard Gumbel distribution with U i sampled from a uniform i.i.d. distribution Unif(0, 1). 8. The method of claim 7 , further comprising approximating, during a backward pass, a gradient of the discrete samples by computing a gradient of a continuous softmax relaxation. 9. The method of claim 1 , further comprising measuring a classification quality, during training, using a standard cross-entropy loss based on: acc = [−y log( ( V ;Θ))] where Θ={θ Φ ,θ LSTM , θ G , θ ψL−2 } and (V; y) is a training video sample with an associated one-hot encoded label vector, wherein V is a video and is a classifier. 10. The method of claim 1 , wherein a prediction for a video frame having a lowest resolution is only processed once by the policy network. 11. A non-transitory computer readable medium comprising computer executable instructions which when executed by a computer cause the computer to perform the method of: obtaining a plurality of video frames of a video; determining a resolution targeted for action classification for classifying each video frame of the plurality of video frames by analyzing each video frame using a policy network, wherein the policy network has a feature extractor and is trained to determine the resolution targeted to action classification; rescaling, based on the determined resolution, each video frame; routing each rescaled video frame to a classifier of a backbone network, wherein the classifier routed to corresponds to the determined resolution; classifying each rescaled video frame using the corresponding classifier of the backbone network to obtain a plurality of classifications; and averaging the classifications to determine an action classification of the video. 12. An apparatus comprising: a memory; and at least one processor, coupled to said memory, and operative to perform operations comprising: obtaining a plurality of video frames of a video; determining a resolution targeted for action classification for classifying each video frame of the plurality of video frames by analyzing each video frame using a policy network, wherein the

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • G06V20/41Primary

    Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title

  • Validation; Performance evaluation; Active pattern learning techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11954910B2 cover?
Methods, apparatus, and systems for multi-resolution processing for video classification. A plurality of video frames of a video are obtained and a resolution for classifying each video frame of the plurality of video frames is determined by analyzing each video frame using a policy network. Based on the determined resolution, each video frame having a determined resolution is rescaled and each…
Who is the assignee on this patent?
IBM, Massachusetts Inst Technology, Massachusetts Institute Of Tech Ma
What technology area does this patent fall under?
Primary CPC classification G06V20/41. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 09 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).