What technology area does this patent fall under?

Primary CPC classification G06V20/46. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 06 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Deep learning method for multiple object tracking from video

Patent metadata
Field	Value
Publication number	US-12518403-B2
Application number	US-202318480127-A
Country	US
Kind code	B2
Filing date	Oct 3, 2023
Priority date	Nov 2, 2022
Publication date	Jan 6, 2026
Grant date	Jan 6, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for multi-object tracking from video. The method includes the following steps: (1) Capturing frames from the streaming source and preprocess the data; (2) Extract video features with three choices: a 3D-CNN backbone followed by a Transformer Encoder, a Video Transformer Encoder, a 2D-CNN Encoder with a stack of frames as input followed by a Transformer Encoder; (3) Multi-object tracking using a new end-to-end multi-task deep learning model named JDAT (Joint Detection Association Transformer), then post-processing and updating tracking state with Temporal Aggregation Module (TAM). The deep learning models in step 2 and step 3 are trained simultaneously end-to-end with a loss function that is accumulated over multiple timesteps (Collective Average Loss—CAL). Also, the model can be pretrained with weakly labeled image dataset in a self-supervised learning manner first, then finetuned on supervised video datasets with full tracking labels.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A method for multi-object tracking from video, comprising the steps of: Step 1: capturing frames from a streaming source and preprocess the data; a Camera Stream Capturing module capture consecutive frames from a data stream from the streaming source, samples the required frames, pre-processes them, concatenate nearby frames into cube of frames, then puts them into a Frames Cube Input Queue as input for the next step, the preprocessing step standardizes every stream to the same frame rate and resolution; Step 2: video feature extraction a feature extraction module takes a cube of adjacent frames as input to a deep learning model capable of extracting features in both space and time (spatial-temporal features) comprising three choices for deep learning feature extractor: (i) a 3D-CNN backbone followed by a Transformer Encoder (ii) a Video Transformer Encoder and (iii) a 2D-CNN encoder with a stack of frames as input followed by a Transformer Encoder; Step 3: multi-object tracking, post-processing and updating tracking state; a object tracking module performs object detection and association based on the input features of the frames of interest, combined with a current tracking state, this step includes three sub-steps as follows: (3.1) object detection and association using a multi-task end-to-end deep learning model, named JDAT (Joint Detection Association Transformer), composed of three key components: a Transformer Decoder model, a Feature Relation Transformer model and a Differentiable Matching Layer; (3.2) post-processing and update tracking state; (3.3) Update tracking state using a Temporal Aggregation Module (TAM), in which, the deep learning models in step 2 and step 3 are trained simultaneously end-to-end with a loss function that is accumulated over multiple timesteps (Collective Average Loss—CAL). 2 . The method for tracking multiple objects from video of claim 1 , wherein: in step 2, employing a different approach to extract features of a frame of interest, which combine information from nearby frames (both before or after) to enhance the extracted features of a target frame, specifically, the feature extraction module takes a cube of adjacent frames as input to a deep learning model with capable of extracting feature in both space and time (spatial-temporal feature), specifically, comprising three choices for deep learning feature extractor: (1) a 3D-CNN backbone followed by a Transformer encoder (2) a Video Transformer Encoder and (3) a 2D-CNN encoder with a stack of frames as input followed by a Transformer Encoder. 3 . The method for tracking multiple objects from video of claim 2 , wherein: in step 2, for the two model choices of 3D-CNN or Video Transformer, the deep learning model takes input as a sequence of K frames sampled around a sequence of T frames of interest (K≥T, typically K is a multiplier of T), represented by a matrix I∈R K×H×W×C , the output of the model is the extracted feature of T frames of interest, represented by a matrix F∈ R T × H R × W R × D (typically, K=32), in which the matrix F i ∈ R H R × W R × D is considered as the feature of the i th frame wherein this model's choice extract features of T frames in a single inference step (parallelism), instead of T inference steps as traditional methods using 2D-CNN over each frame. 4 . The method for tracking multiple objects from video of claim 3 , wherein: in step 2, use 3D-CNNs models such a Temporal Shift Module (TSM), SlowFast, X3D, MoviNet, etc. for video feature extraction, use a Transformer encoder model immediately after the 3D-CNN to increase the ability to extract global features, resolve the weaknesses of CNN models thus improve accuracy in later step, the Transformer part contain N effective linear self-attention layers (e.g., 2≤N≤8), which slightly increases the computational cost but significantly improves the ability to extract global, context-aware interaction thanks to the Transformer's attention mechanism. 5 . The method for tracking multiple objects from video of claim 3 , wherein: in step 2, for a Video Transformer model, use models such as Swin Transformer and MVIT (Multi-scale Vision Transformer) for video feature extraction, these models can effectively extract features in both space and time with global context thanks to Transformer architecture. 6 . The method for tracking multiple objects from video of claim 2 , wherein: in step 2, for the choice of using 2D-CNN variants, extract feature of a single frame Y of interest, specifically, sample K frames around that frame Y of interest, stack the K frames along the channel dimension, obtaining a input matrix I∈R H×W×(C×K) , this matrix becomes the input for common 2D-CNN models such as Resnet, MobileNet, etc., for feature extraction, the output is a feature matrix F∈ R H R × W R × D considered as the feature of the frame Y, these feature matrix is then fed as input to a Transformer Encoder model to be enhanced with the global context information, this additional Transformer Encoder is effective when K is small (e.g. K≤3) without any change to the 2D-CNN architecture, also gaining benefit from various pre-trained models on large datasets, only select and extract feature of the frames of interest, and ensures that features of all frames of interest are extracted for the next processing step. 7 . The method for tracking multiple objects from video of claim 1 , wherein: in step 3, step 3.1, the proposed model is Joint Detection Association Transformer (JDAT), which performs object detection and associates newly detected objects with a list of keeping tracks (track is a term of tracking state corresponding to a specific object instance, each track can be represented by that instance's properties such as: identifier, location, velocity, size, recognition features, etc.) and allows end-to-end training, this model is composed of three key components: (i) a Transformer Decoder model, (ii) a Feature Relation Transformer and (iii) a Differentiable Matching Layer. 8 . The method for tracking multiple objects from video of claim 7 , wherein: the Transformer Decoder model is built based on the idea of DETR (Detection Transformer), an approach for object detection with the Transformer architecture, DETR's output is set-based, use parallel decoding mechanism instead of autoregressive decoding, the set of input queries is constructed as a union of two sets: (1) the object queries set inherits the idea from DETR, consisting of N objs vectors (typically, N objs =100) and (2) the track queries set consists of N tracks vectors, each vector representing a track that is considered to be active at the current time,

Assignees

Viettel Group

Inventors

Classifications

G06V20/46Primary
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
G06V10/82
using neural networks · CPC title
G06T2207/10016
Video; Image sequence · CPC title
G06T2207/20084
Artificial neural networks [ANN] · CPC title
G06T2207/20081
Training; Learning · CPC title

Patent family

Related publications grouped by family.

View patent family 90833965

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12518403B2 cover?: A method for multi-object tracking from video. The method includes the following steps: (1) Capturing frames from the streaming source and preprocess the data; (2) Extract video features with three choices: a 3D-CNN backbone followed by a Transformer Encoder, a Video Transformer Encoder, a 2D-CNN Encoder with a stack of frames as input followed by a Transformer Encoder; (3) Multi-object trackin…
Who is the assignee on this patent?: Viettel Group
What technology area does this patent fall under?: Primary CPC classification G06V20/46. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 06 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Vision transformers leveraging temporal redundancy

Apparatus and method with quantizing of a target tracking model

Method and apparatus for video action classification

Efficient video processing via temporal progressive learning

Method and system for real-time target tracking based on deep learning

Frequently asked questions