Multimodal video summarization
US-2024404283-A1 · Dec 5, 2024 · US
US12400449B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12400449-B2 |
| Application number | US-202217940599-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 8, 2022 |
| Priority date | Sep 14, 2021 |
| Publication date | Aug 26, 2025 |
| Grant date | Aug 26, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method for model training and deployment includes training, by a processor, a model to learn video representations with a self-supervised contrastive loss by performing progressive training in phases with an incremental number of positive instances from one or more video sequences, resetting the learning rate schedule in each of the phases, and inheriting model weights from a checkpoint from a previous training phase. The method further includes updating the trained model with the self-supervised contrastive loss given multiple positive instances obtained from Cascade K-Nearest Neighbor mining of the one or more video sequences by extracting features in different modalities to compute similarities between the one or more video sequences and selecting a top-k similar instances with features in different modalities. The method also includes fine-tuning the trained model for a downstream task. The method additionally includes deploying the trained model for a target application inference for the downstream task.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for model training and deployment, comprising: training, by a hardware processor, a model to learn video representations with a self-supervised contrastive loss by performing progressive training in phases with an incremental number of positive instances from one or more video sequences, resetting a learning rate schedule in each of the phases, and inheriting model weights from a checkpoint from a previous training phase; updating the trained model with the self-supervised contrastive loss given multiple positive instances obtained from Cascade K-Nearest Neighbor mining of the one or more video sequences by extracting features in different modalities to compute similarities between the one or more video sequences and selecting a top-k similar instances with features in different modalities; fine-tuning the trained model for a downstream task; and deploying the trained model for a target application inference for the downstream task. 2. The computer-implemented method of claim 1 , wherein the training further comprises, in a feature space: pulling together positive feature pairs in a same or different modalities; and repelling apart negative feature pairs from spatiotemporally manipulated frames. 3. The computer-implemented method of claim 1 , wherein the target application inference for the downstream task comprises action recognition involving a transformation from an input video to an output textual label indicative of the content of the input video. 4. The computer-implemented method of claim 1 , wherein the self-supervised contrastive loss comprises a dot product and a temperature hyper-parameter to adjust a scale of the dot product. 5. The computer-implemented method of claim 1 , wherein said training step iteratively selects a respective top-k similar instances at each of the phases, and remaining ones of the respective top-k similar instances at a final stage are used to form a positive set. 6. The computer-implemented method of claim 1 , wherein the different modalities comprise decompressed RGB pixels, encoding residuals from frame differences, and motion vectors. 7. The computer-implemented method of claim 1 , wherein two video clips from a same sequence comprise a positive pair, and two video clips from different video sequences comprise a negative pair for said training step. 8. The computer-implemented method of claim 1 , wherein fine-tuning the trained model for a downstream task comprises using downstream task labels to fine-tune a pretrained model. 9. A computer program product for model training and deployment, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: training, by a hardware processor of the computer, a model to learn video representations with a self-supervised contrastive loss by performing progressive training in phases with an incremental number of positive instances from one or more video sequences, resetting a learning rate schedule in each of the phases, and inheriting model weights from a checkpoint from a previous training phase; updating, by the hardware processor, the trained model with the self-supervised contrastive loss given multiple positive instances obtained from Cascade K-Nearest Neighbor mining of the one or more video sequences by extracting features in different modalities to compute similarities between the one or more video sequences and selecting a top-k similar instances with features in different modalities; fine-tuning, by the hardware processor, the trained model for a downstream task; and deploying, by the hardware processor, the trained model for a target application inference for the downstream task. 10. The computer program product of claim 9 , wherein the training further comprises, in a feature space: pulling together positive feature pairs in a same or different modalities; and repelling apart negative feature pairs from spatiotemporally manipulated frames. 11. The computer program product of claim 9 , wherein the target application inference for the downstream task comprises action recognition involving a transformation from an input video to an output textual label indicative of the content of the input video. 12. The computer program product of claim 9 , wherein the self-supervised contrastive loss comprises a dot product and a temperature hyper-parameter to adjust a scale of the dot product. 13. The computer program product of claim 9 , wherein said training step iteratively selects a respective top-k similar instances at each of the phases, and remaining ones of the respective top-k similar instances at a final stage are used to form a positive set. 14. The computer program product of claim 9 , wherein the different modalities comprise decompressed RGB pixels, encoding residuals from frame differences, and motion vectors. 15. The computer program product of claim 9 , wherein two video clips from a same sequence comprise a positive pair, and two video clips from different video sequences comprise a negative pair for said training step. 16. The computer program product of claim 9 , wherein fine-tuning the trained model for a downstream task comprises using downstream task labels to fine-tune a pretrained model. 17. A computer processing system for model training and deployment, comprising: a memory device for storing program code; and a hardware processor operatively coupled to the memory device for running the program code to: train a model to learn video representations with a self-supervised contrastive loss by performing progressive training in phases with an incremental number of positive instances from one or more video sequences, resetting a learning rate schedule in each of the phases, and inheriting model weights from a checkpoint from a previous training phase; update the trained model with the self-supervised contrastive loss given multiple positive instances obtained from Cascade K-Nearest Neighbor mining of the one or more video sequences by extracting features in different modalities to compute similarities between the one or more video sequences and selecting a top-k similar instances with features in different modalities; fine-tune the trained model for a downstream task; and deploy the trained model for a target application inference for the downstream task. 18. The computer processing system of claim 17 , wherein the hardware processor further runs the program code such that the training, in a feature space, involves pulling together negative feature pairs in a same or different modalities, and repelling apart negative feature pairs from spatiotemporally manipulated frames. 19. The computer processing system of claim 17 , wherein the target application inference for the downstream task comprises action recognition involving a transformation from an input video to an output textual label indicative of the content of the input video. 20. The computer processing system of claim 17 , wherein the self-supervised contrastive loss comprises a dot product and a temperature hyper-parameter to adjust a scale of the dot product.
Feature selection, e.g. selecting representative features from a multi-dimensional feature space · CPC title
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.