Who is the assignee on this patent?

IBM, Massachusetts Inst Technology, Massachusetts Institute Of Tech Ma

What technology area does this patent fall under?

Primary CPC classification G06V20/41. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 09 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Dynamic multi-resolution processing for video classification

Patent metadata
Field	Value
Publication number	US-11954910-B2
Application number	US-202017134315-A
Country	US
Kind code	B2
Filing date	Dec 26, 2020
Priority date	Dec 26, 2020
Publication date	Apr 9, 2024
Grant date	Apr 9, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, apparatus, and systems for multi-resolution processing for video classification. A plurality of video frames of a video are obtained and a resolution for classifying each video frame of the plurality of video frames is determined by analyzing each video frame using a policy network. Based on the determined resolution, each video frame having a determined resolution is rescaled and each rescaled video frame is routed to a classifier of a backbone network that corresponds to the determined resolution. Each rescaled video frame is classified using the corresponding classifier of the backbone network to obtain a plurality of classifications and the classifications are averaged to determine an action classification of the video.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: obtaining a plurality of video frames of a video; determining a resolution targeted for action classification for classifying each video frame of the plurality of video frames by analyzing each video frame using a policy network, wherein the policy network has a feature extractor and is trained to determine the resolution targeted to action classification; rescaling, based on the determined resolution targeted for action classification, each video frame; routing each rescaled video frame to a classifier of a backbone network, wherein the classifier routed to corresponds to the determined resolution; classifying each rescaled video frame using the corresponding classifier of the backbone network to obtain a plurality of classifications; and averaging the classifications to determine an action classification of the video. 2. The method of claim 1 , further comprising jointly training the policy network and a recognition model of the backbone network using standard backpropagation. 3. The method of claim 2 , further comprising: sampling policies from a Gumbel Softmax distribution, enabling optimization of the policy network via the standard backpropagation; and optimizing the policy network based on the sampled policies. 4. The method of claim 1 , wherein the policy network contains a long short-term memory (LSTM) module that performs the determination of the resolution. 5. The method of claim 1 , wherein given a hidden state, the policy network estimates a policy distribution and samples an action at a t ∈Ω={0, 1, . . . L+M−1} via a Gumbel Softmax operation: a t ˜GUMBEL( h t ,θ G ) wherein, when a t <L, the video frame is rescaled to spatial resolution 3×H a t ×W a t and forwarded to the corresponding backbone network ψ a t (⋅; θ ψa t ) to get a frame-level prediction, y t a t =ψ a t ( I t a t ;θ ψa t ) wherein I t a t ∈ ℝ 3 × H a t × W a t is the rescaled video frame and y t a t ∈ C is the frame-level prediction; and wherein, when the action a t >=L, the video frame is skipped for prediction and a subsequent (F a t −L −1) frames are skipped by the policy network, wherein a t represents an action a t time t, L is a number of resolutions, M is a number of frames in a skipping sequence, (H x , W x ) is a frame resolution, h t is a hidden state, and θ G denotes learnable parameters, and ψ a t (⋅; θ ψa t ) is a corresponding backbone network. 6. The method of claim 1 , further comprising substituting a differentiable sample from a corresponding Gumbel-Softmax distribution for an original non-differentiable sample from a discrete distribution. 7. The method of claim 6 , further comprising generating, a t each time step t, a logits z∈ L+M−1 from hidden states h t by a fully-connected layer z=FC(h t , θ FC ) and using a Softmax to generate a categorical distribution π t , π t = { p i ❘ p i = exp ⁡ ( z i ) ∑ j = 0 L + M - 1 ⁢ exp ⁡ ( z j ) } wherein discrete samples from a categorical distribution are drawn as follows: {circumflex over (p)} =arg max i (log p i +G i ) where G i =−log (−logU i ) is a standard Gumbel distribution with U i sampled from a uniform i.i.d. distribution Unif(0, 1). 8. The method of claim 7 , further comprising approximating, during a backward pass, a gradient of the discrete samples by computing a gradient of a continuous softmax relaxation. 9. The method of claim 1 , further comprising measuring a classification quality, during training, using a standard cross-entropy loss based on: acc = [−y log( ( V ;Θ))] where Θ={θ Φ ,θ LSTM , θ G , θ ψL−2 } and (V; y) is a training video sample with an associated one-hot encoded label vector, wherein V is a video and is a classifier. 10. The method of claim 1 , wherein a prediction for a video frame having a lowest resolution is only processed once by the policy network. 11. A non-transitory computer readable medium comprising computer executable instructions which when executed by a computer cause the computer to perform the method of: obtaining a plurality of video frames of a video; determining a resolution targeted for action classification for classifying each video frame of the plurality of video frames by analyzing each video frame using a policy network, wherein the policy network has a feature extractor and is trained to determine the resolution targeted to action classification; rescaling, based on the determined resolution, each video frame; routing each rescaled video frame to a classifier of a backbone network, wherein the classifier routed to corresponds to the determined resolution; classifying each rescaled video frame using the corresponding classifier of the backbone network to obtain a plurality of classifications; and averaging the classifications to determine an action classification of the video. 12. An apparatus comprising: a memory; and at least one processor, coupled to said memory, and operative to perform operations comprising: obtaining a plurality of video frames of a video; determining a resolution targeted for action classification for classifying each video frame of the plurality of video frames by analyzing each video frame using a policy network, wherein the

Assignees

Inventors

Classifications

G06N3/09
Supervised learning · CPC title
G06N3/0464
Convolutional networks [CNN, ConvNet] · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
G06V20/41Primary
Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items (segmenting video sequences G06V20/49) · CPC title
G06F18/217
Validation; Performance evaluation; Active pattern learning techniques · CPC title

Patent family

Related publications grouped by family.

View patent family 82218697

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11954910B2 cover?: Methods, apparatus, and systems for multi-resolution processing for video classification. A plurality of video frames of a video are obtained and a resolution for classifying each video frame of the plurality of video frames is determined by analyzing each video frame using a policy network. Based on the determined resolution, each video frame having a determined resolution is rescaled and each…
Who is the assignee on this patent?: IBM, Massachusetts Inst Technology, Massachusetts Institute Of Tech Ma
What technology area does this patent fall under?: Primary CPC classification G06V20/41. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 09 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Action recognition with high-order interaction through spatial-temporal object tracking

Techniques for evaluating compressed motion video quality

Method for analysing media content

Reducing image resolution in deep convolutional networks

Frequently asked questions