Machine learning framework applied in a semi-supervised setting to perform instance tracking in a sequence of image frames

US12400341B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12400341-B2
Application numberUS-202217570254-A
CountryUS
Kind codeB2
Filing dateJan 6, 2022
Priority dateJan 8, 2021
Publication dateAug 26, 2025
Grant dateAug 26, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and system are provided for tracking instances within a sequence of video frames. The method includes the steps of processing an image frame by a backbone network to generate a set of feature maps, processing the set of feature maps by one or more prediction heads, and analyzing the embedding features corresponding to a set of instances in two or more image frames of the sequence of video frames to establish a one-to-one correlation between instances in different image frames. The one or more prediction heads includes an embedding head configured to generate a set of embedding features corresponding to one or more instances of an object identified in the image frame. The method may also include training the one or more prediction heads using a set of annotated image frames and/or a plurality of sequences of unlabeled video frames.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for tracking instances of objects within a sequence of video frames, the method comprising: processing a first frame from the sequence of video frames by a backbone network to generate a first set of feature maps; processing a second frame from the sequence of video frames by the backbone network to generate a second set of feature maps; generating, by an embedding head, a first set of embedding vectors, wherein the embedding head generates the first set of embedding vectors by processing the first set of feature maps, wherein each embedding vector of the first set of embedding vectors corresponds to an instance of an object identified in the first frame; generating, by the embedding head, a second set of embedding vectors, wherein the embedding head generates the second set of embedding vectors by processing the second set of feature maps, wherein each embedding vector of the second set of embedding vectors corresponds to an instance of an object identified in the second frame; and generating, by a prediction head, a first predicted heatmap of keypoint locations for the first frame and a second predicted heatmap of keypoint locations for the second frame, wherein the prediction head generates the first predicted heatmap by processing the first set of feature maps and the prediction head generates the second predicted heatmap by processing the second set of feature maps, wherein the embedding head comprises a keypoint embedding head including an encoder-decoder structure, wherein the encoder-decoder structure includes an encoder comprising a convolutional layer and a decoder comprising a de-convolutional layer. 2. The method of claim 1 , further comprising processing the set of feature maps by a classification head and a shape regression head configured to provide a pose estimation for each instance of an object based on a plurality of keypoints. 3. The method of claim 1 , further comprising: concatenating the first predicted heatmap of keypoint locations and the first set of feature maps to produce first keypoint embedding head input; concatenating the second predicted heatmap of keypoint locations and the second set of feature maps to produce second keypoint embedding head input; generating, by the keypoint embedding head by processing the first keypoint embedding head input, a first set of keypoint embedding vectors; generating, by the keypoint embedding head by processing the second keypoint embedding head input, a second set of keypoint embedding vectors; and analyzing the first set of keypoint embedding vectors and the second set of keypoint embedding vectors to perform pose tracking of an instance of an object. 4. A method for tracking instances of objects within a sequence of video frames, the method comprising: processing a first frame from the sequence of video frames by a backbone network to generate a first set of feature maps; processing a second frame from the sequence of video frames by the backbone network to generate a second set of feature maps; generating, by an embedding head, a first set of embedding vectors, wherein the embedding head generates the first set of embedding vectors by processing the first set of feature maps, wherein each embedding vector of the first set of embedding vectors corresponds to an instance of an object identified in the first frame; generating, by the embedding head, a second set of embedding vectors, wherein the embedding head generates the second set of embedding vectors by processing the second set of feature maps, wherein each embedding vector of the second set of embedding vectors corresponds to an instance of an object identified in the second frame; and comparing a first center representation, obtained by averaging embedding vectors from the first set of embedding vectors, to a second center representation, obtained by averaging embedding vectors from the second set of embedding vectors, to establish a one-to-one correlation between the instance of the object in the first frame and the instance of the object in the second frame. 5. The method of claim 4 , further comprising: predicting, by a classification head, a location of instances of objects in the first frame and the second frame; and predicting, by a mask head, a pixel level segmentation mask for each instance of an object identified in the first frame and for each instance of an object identified in the second frame. 6. The method of claim 4 , wherein the backbone network comprises a feature pyramid network, wherein the first set of feature maps comprises a first plurality of feature maps of different spatial resolutions, and wherein the second set of feature maps comprises a second plurality of feature maps of different spatial resolutions. 7. The method of claim 4 , further comprising training the embedding head using a set of annotated image frames and/or a plurality of sequences of unlabeled video frames. 8. The method of claim 7 , wherein training the embedding head comprises minimizing an instance contrastive loss term. 9. The method of claim 8 , wherein training the embedding head further comprises enforcing maximum entropy regularization for a similarity matrix. 10. The method of claim 7 , wherein training the embedding head comprises minimizing a cycle loss term calculated based on a forward affinity matrix and a reverse affinity matrix corresponding to a sequence of video frames. 11. The method of claim 4 , wherein the comparing the first center representation to the second center representation comprises computing a similarity value for the first center representation and the second center representation. 12. The method of claim 11 , wherein the similarity value is a cosine similarity value. 13. A system for tracking instances of objects within a sequence of video frames, comprising: a non-transitory computer-readable memory; and at least one processor configured to: implement a plurality of neural networks including: a backbone network configured to: process a first frame from the sequence of video frames to generate a first set of feature maps, and process a second frame from the sequence of video frames to generate a second set of feature maps, and an embedding head configured to: generate, by processing the first set of feature maps, a first set of embedding vectors, each embedding vector of the first set of embedding vectors corresponding to an instance of an object identified in the first frame, and generate, by processing the second set of feature maps, a second set of embedding vectors, each embedding vector of the second set of embedding vectors corresponding to an instance of an object identified in the second frame, and compare a first center representation, obtained by averaging embedding vectors from the first set of embedding vectors, to a second center representation, obtained by averaging embedding vectors from the second set of embedding vectors, to establish a one-to-one correlation between the instance of the object in the first frame and the instance of the object in the second frame. 14. The system of claim 13 , wherein the plurality of neural networks further comprises: a classification head configured to predict a location of instances of objects in the first frame and the second frame; and a mask head configured to predict a pixel level segmentation mask for each instance of an object identified in the first frame and for each instance of an object identified in the second frame. 15. The system of claim 13 , wherein the embedding head comprises a keypoint embedding head including an encoder-decoder st

Assignees

Inventors

Classifications

  • Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title

  • Training; Learning · CPC title

  • Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform · CPC title

  • Artificial neural networks [ANN] · CPC title

  • Video; Image sequence · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12400341B2 cover?
A method and system are provided for tracking instances within a sequence of video frames. The method includes the steps of processing an image frame by a backbone network to generate a set of feature maps, processing the set of feature maps by one or more prediction heads, and analyzing the embedding features corresponding to a set of instances in two or more image frames of the sequence of vi…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification G06T7/248. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 26 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).