Vision transformers leveraging temporal redundancy
US-2025095349-A1 · Mar 20, 2025 · US
US12518403B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12518403-B2 |
| Application number | US-202318480127-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 3, 2023 |
| Priority date | Nov 2, 2022 |
| Publication date | Jan 6, 2026 |
| Grant date | Jan 6, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for multi-object tracking from video. The method includes the following steps: (1) Capturing frames from the streaming source and preprocess the data; (2) Extract video features with three choices: a 3D-CNN backbone followed by a Transformer Encoder, a Video Transformer Encoder, a 2D-CNN Encoder with a stack of frames as input followed by a Transformer Encoder; (3) Multi-object tracking using a new end-to-end multi-task deep learning model named JDAT (Joint Detection Association Transformer), then post-processing and updating tracking state with Temporal Aggregation Module (TAM). The deep learning models in step 2 and step 3 are trained simultaneously end-to-end with a loss function that is accumulated over multiple timesteps (Collective Average Loss—CAL). Also, the model can be pretrained with weakly labeled image dataset in a self-supervised learning manner first, then finetuned on supervised video datasets with full tracking labels.
Opening claim text (preview).
The invention claimed is: 1 . A method for multi-object tracking from video, comprising the steps of: Step 1: capturing frames from a streaming source and preprocess the data; a Camera Stream Capturing module capture consecutive frames from a data stream from the streaming source, samples the required frames, pre-processes them, concatenate nearby frames into cube of frames, then puts them into a Frames Cube Input Queue as input for the next step, the preprocessing step standardizes every stream to the same frame rate and resolution; Step 2: video feature extraction a feature extraction module takes a cube of adjacent frames as input to a deep learning model capable of extracting features in both space and time (spatial-temporal features) comprising three choices for deep learning feature extractor: (i) a 3D-CNN backbone followed by a Transformer Encoder (ii) a Video Transformer Encoder and (iii) a 2D-CNN encoder with a stack of frames as input followed by a Transformer Encoder; Step 3: multi-object tracking, post-processing and updating tracking state; a object tracking module performs object detection and association based on the input features of the frames of interest, combined with a current tracking state, this step includes three sub-steps as follows: (3.1) object detection and association using a multi-task end-to-end deep learning model, named JDAT (Joint Detection Association Transformer), composed of three key components: a Transformer Decoder model, a Feature Relation Transformer model and a Differentiable Matching Layer; (3.2) post-processing and update tracking state; (3.3) Update tracking state using a Temporal Aggregation Module (TAM), in which, the deep learning models in step 2 and step 3 are trained simultaneously end-to-end with a loss function that is accumulated over multiple timesteps (Collective Average Loss—CAL). 2 . The method for tracking multiple objects from video of claim 1 , wherein: in step 2, employing a different approach to extract features of a frame of interest, which combine information from nearby frames (both before or after) to enhance the extracted features of a target frame, specifically, the feature extraction module takes a cube of adjacent frames as input to a deep learning model with capable of extracting feature in both space and time (spatial-temporal feature), specifically, comprising three choices for deep learning feature extractor: (1) a 3D-CNN backbone followed by a Transformer encoder (2) a Video Transformer Encoder and (3) a 2D-CNN encoder with a stack of frames as input followed by a Transformer Encoder. 3 . The method for tracking multiple objects from video of claim 2 , wherein: in step 2, for the two model choices of 3D-CNN or Video Transformer, the deep learning model takes input as a sequence of K frames sampled around a sequence of T frames of interest (K≥T, typically K is a multiplier of T), represented by a matrix I∈R K×H×W×C , the output of the model is the extracted feature of T frames of interest, represented by a matrix F∈ R T × H R × W R × D (typically, K=32), in which the matrix F i ∈ R H R × W R × D is considered as the feature of the i th frame wherein this model's choice extract features of T frames in a single inference step (parallelism), instead of T inference steps as traditional methods using 2D-CNN over each frame. 4 . The method for tracking multiple objects from video of claim 3 , wherein: in step 2, use 3D-CNNs models such a Temporal Shift Module (TSM), SlowFast, X3D, MoviNet, etc. for video feature extraction, use a Transformer encoder model immediately after the 3D-CNN to increase the ability to extract global features, resolve the weaknesses of CNN models thus improve accuracy in later step, the Transformer part contain N effective linear self-attention layers (e.g., 2≤N≤8), which slightly increases the computational cost but significantly improves the ability to extract global, context-aware interaction thanks to the Transformer's attention mechanism. 5 . The method for tracking multiple objects from video of claim 3 , wherein: in step 2, for a Video Transformer model, use models such as Swin Transformer and MVIT (Multi-scale Vision Transformer) for video feature extraction, these models can effectively extract features in both space and time with global context thanks to Transformer architecture. 6 . The method for tracking multiple objects from video of claim 2 , wherein: in step 2, for the choice of using 2D-CNN variants, extract feature of a single frame Y of interest, specifically, sample K frames around that frame Y of interest, stack the K frames along the channel dimension, obtaining a input matrix I∈R H×W×(C×K) , this matrix becomes the input for common 2D-CNN models such as Resnet, MobileNet, etc., for feature extraction, the output is a feature matrix F∈ R H R × W R × D considered as the feature of the frame Y, these feature matrix is then fed as input to a Transformer Encoder model to be enhanced with the global context information, this additional Transformer Encoder is effective when K is small (e.g. K≤3) without any change to the 2D-CNN architecture, also gaining benefit from various pre-trained models on large datasets, only select and extract feature of the frames of interest, and ensures that features of all frames of interest are extracted for the next processing step. 7 . The method for tracking multiple objects from video of claim 1 , wherein: in step 3, step 3.1, the proposed model is Joint Detection Association Transformer (JDAT), which performs object detection and associates newly detected objects with a list of keeping tracks (track is a term of tracking state corresponding to a specific object instance, each track can be represented by that instance's properties such as: identifier, location, velocity, size, recognition features, etc.) and allows end-to-end training, this model is composed of three key components: (i) a Transformer Decoder model, (ii) a Feature Relation Transformer and (iii) a Differentiable Matching Layer. 8 . The method for tracking multiple objects from video of claim 7 , wherein: the Transformer Decoder model is built based on the idea of DETR (Detection Transformer), an approach for object detection with the Transformer architecture, DETR's output is set-based, use parallel decoding mechanism instead of autoregressive decoding, the set of input queries is constructed as a union of two sets: (1) the object queries set inherits the idea from DETR, consisting of N objs vectors (typically, N objs =100) and (2) the track queries set consists of N tracks vectors, each vector representing a track that is considered to be active at the current time,
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
using neural networks · CPC title
Video; Image sequence · CPC title
Artificial neural networks [ANN] · CPC title
Training; Learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.