Self-supervised multimodal representation learning with cascade positive example mining

US12400449B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12400449-B2
Application numberUS-202217940599-A
CountryUS
Kind codeB2
Filing dateSep 8, 2022
Priority dateSep 14, 2021
Publication dateAug 26, 2025
Grant dateAug 26, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for model training and deployment includes training, by a processor, a model to learn video representations with a self-supervised contrastive loss by performing progressive training in phases with an incremental number of positive instances from one or more video sequences, resetting the learning rate schedule in each of the phases, and inheriting model weights from a checkpoint from a previous training phase. The method further includes updating the trained model with the self-supervised contrastive loss given multiple positive instances obtained from Cascade K-Nearest Neighbor mining of the one or more video sequences by extracting features in different modalities to compute similarities between the one or more video sequences and selecting a top-k similar instances with features in different modalities. The method also includes fine-tuning the trained model for a downstream task. The method additionally includes deploying the trained model for a target application inference for the downstream task.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for model training and deployment, comprising: training, by a hardware processor, a model to learn video representations with a self-supervised contrastive loss by performing progressive training in phases with an incremental number of positive instances from one or more video sequences, resetting a learning rate schedule in each of the phases, and inheriting model weights from a checkpoint from a previous training phase; updating the trained model with the self-supervised contrastive loss given multiple positive instances obtained from Cascade K-Nearest Neighbor mining of the one or more video sequences by extracting features in different modalities to compute similarities between the one or more video sequences and selecting a top-k similar instances with features in different modalities; fine-tuning the trained model for a downstream task; and deploying the trained model for a target application inference for the downstream task. 2. The computer-implemented method of claim 1 , wherein the training further comprises, in a feature space: pulling together positive feature pairs in a same or different modalities; and repelling apart negative feature pairs from spatiotemporally manipulated frames. 3. The computer-implemented method of claim 1 , wherein the target application inference for the downstream task comprises action recognition involving a transformation from an input video to an output textual label indicative of the content of the input video. 4. The computer-implemented method of claim 1 , wherein the self-supervised contrastive loss comprises a dot product and a temperature hyper-parameter to adjust a scale of the dot product. 5. The computer-implemented method of claim 1 , wherein said training step iteratively selects a respective top-k similar instances at each of the phases, and remaining ones of the respective top-k similar instances at a final stage are used to form a positive set. 6. The computer-implemented method of claim 1 , wherein the different modalities comprise decompressed RGB pixels, encoding residuals from frame differences, and motion vectors. 7. The computer-implemented method of claim 1 , wherein two video clips from a same sequence comprise a positive pair, and two video clips from different video sequences comprise a negative pair for said training step. 8. The computer-implemented method of claim 1 , wherein fine-tuning the trained model for a downstream task comprises using downstream task labels to fine-tune a pretrained model. 9. A computer program product for model training and deployment, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: training, by a hardware processor of the computer, a model to learn video representations with a self-supervised contrastive loss by performing progressive training in phases with an incremental number of positive instances from one or more video sequences, resetting a learning rate schedule in each of the phases, and inheriting model weights from a checkpoint from a previous training phase; updating, by the hardware processor, the trained model with the self-supervised contrastive loss given multiple positive instances obtained from Cascade K-Nearest Neighbor mining of the one or more video sequences by extracting features in different modalities to compute similarities between the one or more video sequences and selecting a top-k similar instances with features in different modalities; fine-tuning, by the hardware processor, the trained model for a downstream task; and deploying, by the hardware processor, the trained model for a target application inference for the downstream task. 10. The computer program product of claim 9 , wherein the training further comprises, in a feature space: pulling together positive feature pairs in a same or different modalities; and repelling apart negative feature pairs from spatiotemporally manipulated frames. 11. The computer program product of claim 9 , wherein the target application inference for the downstream task comprises action recognition involving a transformation from an input video to an output textual label indicative of the content of the input video. 12. The computer program product of claim 9 , wherein the self-supervised contrastive loss comprises a dot product and a temperature hyper-parameter to adjust a scale of the dot product. 13. The computer program product of claim 9 , wherein said training step iteratively selects a respective top-k similar instances at each of the phases, and remaining ones of the respective top-k similar instances at a final stage are used to form a positive set. 14. The computer program product of claim 9 , wherein the different modalities comprise decompressed RGB pixels, encoding residuals from frame differences, and motion vectors. 15. The computer program product of claim 9 , wherein two video clips from a same sequence comprise a positive pair, and two video clips from different video sequences comprise a negative pair for said training step. 16. The computer program product of claim 9 , wherein fine-tuning the trained model for a downstream task comprises using downstream task labels to fine-tune a pretrained model. 17. A computer processing system for model training and deployment, comprising: a memory device for storing program code; and a hardware processor operatively coupled to the memory device for running the program code to: train a model to learn video representations with a self-supervised contrastive loss by performing progressive training in phases with an incremental number of positive instances from one or more video sequences, resetting a learning rate schedule in each of the phases, and inheriting model weights from a checkpoint from a previous training phase; update the trained model with the self-supervised contrastive loss given multiple positive instances obtained from Cascade K-Nearest Neighbor mining of the one or more video sequences by extracting features in different modalities to compute similarities between the one or more video sequences and selecting a top-k similar instances with features in different modalities; fine-tune the trained model for a downstream task; and deploy the trained model for a target application inference for the downstream task. 18. The computer processing system of claim 17 , wherein the hardware processor further runs the program code such that the training, in a feature space, involves pulling together negative feature pairs in a same or different modalities, and repelling apart negative feature pairs from spatiotemporally manipulated frames. 19. The computer processing system of claim 17 , wherein the target application inference for the downstream task comprises action recognition involving a transformation from an input video to an output textual label indicative of the content of the input video. 20. The computer processing system of claim 17 , wherein the self-supervised contrastive loss comprises a dot product and a temperature hyper-parameter to adjust a scale of the dot product.

Assignees

Inventors

Classifications

  • Feature selection, e.g. selecting representative features from a multi-dimensional feature space · CPC title

  • using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

  • G06V20/46Primary

    Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

  • G06V20/49Primary

    Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12400449B2 cover?
A method for model training and deployment includes training, by a processor, a model to learn video representations with a self-supervised contrastive loss by performing progressive training in phases with an incremental number of positive instances from one or more video sequences, resetting the learning rate schedule in each of the phases, and inheriting model weights from a checkpoint from …
Who is the assignee on this patent?
Nec Lab America Inc, Nec Corp
What technology area does this patent fall under?
Primary CPC classification G06V20/46. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 26 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).