Deep learning method for multiple object tracking from video

US12518403B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12518403-B2
Application numberUS-202318480127-A
CountryUS
Kind codeB2
Filing dateOct 3, 2023
Priority dateNov 2, 2022
Publication dateJan 6, 2026
Grant dateJan 6, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for multi-object tracking from video. The method includes the following steps: (1) Capturing frames from the streaming source and preprocess the data; (2) Extract video features with three choices: a 3D-CNN backbone followed by a Transformer Encoder, a Video Transformer Encoder, a 2D-CNN Encoder with a stack of frames as input followed by a Transformer Encoder; (3) Multi-object tracking using a new end-to-end multi-task deep learning model named JDAT (Joint Detection Association Transformer), then post-processing and updating tracking state with Temporal Aggregation Module (TAM). The deep learning models in step 2 and step 3 are trained simultaneously end-to-end with a loss function that is accumulated over multiple timesteps (Collective Average Loss—CAL). Also, the model can be pretrained with weakly labeled image dataset in a self-supervised learning manner first, then finetuned on supervised video datasets with full tracking labels.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A method for multi-object tracking from video, comprising the steps of: Step 1: capturing frames from a streaming source and preprocess the data; a Camera Stream Capturing module capture consecutive frames from a data stream from the streaming source, samples the required frames, pre-processes them, concatenate nearby frames into cube of frames, then puts them into a Frames Cube Input Queue as input for the next step, the preprocessing step standardizes every stream to the same frame rate and resolution; Step 2: video feature extraction a feature extraction module takes a cube of adjacent frames as input to a deep learning model capable of extracting features in both space and time (spatial-temporal features) comprising three choices for deep learning feature extractor: (i) a 3D-CNN backbone followed by a Transformer Encoder (ii) a Video Transformer Encoder and (iii) a 2D-CNN encoder with a stack of frames as input followed by a Transformer Encoder; Step 3: multi-object tracking, post-processing and updating tracking state; a object tracking module performs object detection and association based on the input features of the frames of interest, combined with a current tracking state, this step includes three sub-steps as follows: (3.1) object detection and association using a multi-task end-to-end deep learning model, named JDAT (Joint Detection Association Transformer), composed of three key components: a Transformer Decoder model, a Feature Relation Transformer model and a Differentiable Matching Layer; (3.2) post-processing and update tracking state; (3.3) Update tracking state using a Temporal Aggregation Module (TAM), in which, the deep learning models in step 2 and step 3 are trained simultaneously end-to-end with a loss function that is accumulated over multiple timesteps (Collective Average Loss—CAL). 2 . The method for tracking multiple objects from video of claim 1 , wherein: in step 2, employing a different approach to extract features of a frame of interest, which combine information from nearby frames (both before or after) to enhance the extracted features of a target frame, specifically, the feature extraction module takes a cube of adjacent frames as input to a deep learning model with capable of extracting feature in both space and time (spatial-temporal feature), specifically, comprising three choices for deep learning feature extractor: (1) a 3D-CNN backbone followed by a Transformer encoder (2) a Video Transformer Encoder and (3) a 2D-CNN encoder with a stack of frames as input followed by a Transformer Encoder. 3 . The method for tracking multiple objects from video of claim 2 , wherein: in step 2, for the two model choices of 3D-CNN or Video Transformer, the deep learning model takes input as a sequence of K frames sampled around a sequence of T frames of interest (K≥T, typically K is a multiplier of T), represented by a matrix I∈R K×H×W×C , the output of the model is the extracted feature of T frames of interest, represented by a matrix F∈ R T × H R × W R × D (typically, K=32), in which the matrix F i ∈ R H R × W R × D is considered as the feature of the i th frame wherein this model's choice extract features of T frames in a single inference step (parallelism), instead of T inference steps as traditional methods using 2D-CNN over each frame. 4 . The method for tracking multiple objects from video of claim 3 , wherein: in step 2, use 3D-CNNs models such a Temporal Shift Module (TSM), SlowFast, X3D, MoviNet, etc. for video feature extraction, use a Transformer encoder model immediately after the 3D-CNN to increase the ability to extract global features, resolve the weaknesses of CNN models thus improve accuracy in later step, the Transformer part contain N effective linear self-attention layers (e.g., 2≤N≤8), which slightly increases the computational cost but significantly improves the ability to extract global, context-aware interaction thanks to the Transformer's attention mechanism. 5 . The method for tracking multiple objects from video of claim 3 , wherein: in step 2, for a Video Transformer model, use models such as Swin Transformer and MVIT (Multi-scale Vision Transformer) for video feature extraction, these models can effectively extract features in both space and time with global context thanks to Transformer architecture. 6 . The method for tracking multiple objects from video of claim 2 , wherein: in step 2, for the choice of using 2D-CNN variants, extract feature of a single frame Y of interest, specifically, sample K frames around that frame Y of interest, stack the K frames along the channel dimension, obtaining a input matrix I∈R H×W×(C×K) , this matrix becomes the input for common 2D-CNN models such as Resnet, MobileNet, etc., for feature extraction, the output is a feature matrix F∈ R H R × W R × D considered as the feature of the frame Y, these feature matrix is then fed as input to a Transformer Encoder model to be enhanced with the global context information, this additional Transformer Encoder is effective when K is small (e.g. K≤3) without any change to the 2D-CNN architecture, also gaining benefit from various pre-trained models on large datasets, only select and extract feature of the frames of interest, and ensures that features of all frames of interest are extracted for the next processing step. 7 . The method for tracking multiple objects from video of claim 1 , wherein: in step 3, step 3.1, the proposed model is Joint Detection Association Transformer (JDAT), which performs object detection and associates newly detected objects with a list of keeping tracks (track is a term of tracking state corresponding to a specific object instance, each track can be represented by that instance's properties such as: identifier, location, velocity, size, recognition features, etc.) and allows end-to-end training, this model is composed of three key components: (i) a Transformer Decoder model, (ii) a Feature Relation Transformer and (iii) a Differentiable Matching Layer. 8 . The method for tracking multiple objects from video of claim 7 , wherein: the Transformer Decoder model is built based on the idea of DETR (Detection Transformer), an approach for object detection with the Transformer architecture, DETR's output is set-based, use parallel decoding mechanism instead of autoregressive decoding, the set of input queries is constructed as a union of two sets: (1) the object queries set inherits the idea from DETR, consisting of N objs vectors (typically, N objs =100) and (2) the track queries set consists of N tracks vectors, each vector representing a track that is considered to be active at the current time,

Assignees

Inventors

Classifications

  • G06V20/46Primary

    Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

  • using neural networks · CPC title

  • Video; Image sequence · CPC title

  • Artificial neural networks [ANN] · CPC title

  • Training; Learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12518403B2 cover?
A method for multi-object tracking from video. The method includes the following steps: (1) Capturing frames from the streaming source and preprocess the data; (2) Extract video features with three choices: a 3D-CNN backbone followed by a Transformer Encoder, a Video Transformer Encoder, a 2D-CNN Encoder with a stack of frames as input followed by a Transformer Encoder; (3) Multi-object trackin…
Who is the assignee on this patent?
Viettel Group
What technology area does this patent fall under?
Primary CPC classification G06V20/46. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 06 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).