Moment localization in media stream

US12406500B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12406500-B2
Application numberUS-202017768815-A
CountryUS
Kind codeB2
Filing dateOct 19, 2020
Priority dateNov 1, 2019
Publication dateSep 2, 2025
Grant dateSep 2, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Various implementations of the subject matter relate to moment localization in media stream. In some implementations, a two-dimensional temporal feature map representing a plurality of moments within a media stream is extracted from the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments. A correlation between the plurality of moments and an action in the media stream is determined based on the two-dimensional temporal feature map.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method, comprising: extracting, from a media stream, a two-dimensional temporal feature map representing a plurality of moments within the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments; encoding a sentence feature extracted from an input; fusing the encoded sentence feature with the two-dimensional temporal feature map into a unified subspace as a fused two-dimensional temporal map; applying a convolutional layer to the two-dimensional temporal feature map to obtain a further feature map having a same dimension as the two-dimensional temporal feature map, the convolutional layer comprises a dilated convolution and strides of the dilated convolution are configured to increase as lengths of the respective moments increase; generating a temporal adjacent network using the fused two-dimensional temporal map and the further feature map; determining, using the temporal adjacent network, a correlation between the plurality of moments and an action in the media stream; and identifying a matching a set of candidate moments for the input using the temporal adjacent network. 2. The method of claim 1 , wherein extracting the two-dimensional temporal feature map comprises: segmenting the media stream into a plurality of clips; extracting features of respective ones of the plurality of clips to obtain a feature map of the media stream; and extracting, from features of one or more clips corresponding to a moment of the plurality of moments in the feature map of the media stream, features of this moment as a part of the two-dimensional temporal feature map. 3. The method of claim 1 , wherein determining the correlation comprises: sampling the plurality of moments at respective sample rates to determine a plurality of candidate moments, wherein the sample rates are adaptively adjusted based on lengths of respective ones of the plurality of moments; and determining a correlation between the plurality of candidate moments and the action in the media stream. 4. The method of claim 3 , wherein the sample rates are configured to decrease as the lengths of the respective moments increase. 5. The method of claim 1 , wherein determining the correlation comprises: applying a convolutional layer to the two-dimensional temporal feature map to obtain a further feature map having a same dimension as the two-dimensional temporal feature map; and determining, based on the further feature map, scores of correlation between the plurality of moments and the action in the media stream. 6. The method of claim 1 , wherein determining the correlation comprises: in response to receiving a query for a particular action in the media stream, extracting a feature vector of the query; and determining the correlation based on the feature vector of the query and the two-dimensional temporal feature map. 7. The method of claim 6 , wherein determining the correlation comprises: fusing the feature vector of the query and the two-dimensional temporal feature map to generate a further two-dimensional temporal feature map having a same dimension as the two-dimensional temporal feature map; and determining, based on the further two-dimensional temporal feature map, the correlation between the plurality of moments and the particular action. 8. The method of claim 7 , wherein fusing the feature vector of the query and the two-dimensional temporal feature map comprises: generating the further two-dimensional temporal feature map by applying a Hadamard product to the feature vector of the query and the two-dimensional temporal feature map. 9. The method of claim 6 , wherein the query comprises a natural language query. 10. The method of claim 1 , wherein the media stream comprises an untrimmed media stream. 11. A device comprising: a processing unit; and a memory coupled to the processing unit and having instructions stored thereon, the instructions, when executed by the processing unit, causing the device to perform acts comprising: extracting, from a media stream, a two-dimensional temporal feature map representing a plurality of moments within the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments; encoding a sentence feature extracted from an input; fusing the encoded sentence feature with the two-dimensional temporal feature map into a unified subspace as a fused two-dimensional temporal map; applying a convolutional layer to the two-dimensional temporal feature map to obtain a further feature map having a same dimension as the two-dimensional temporal feature map, the convolutional layer comprises a dilated convolution and strides of the dilated convolution are configured to increase as lengths of the respective moments increase; generating a temporal adjacent network using the fused two-dimensional temporal map and the further feature map; determining, using the temporal adjacent network, a correlation between the plurality of moments and an action in the media stream; and identifying a matching a set of candidate moments for the input using the temporal adjacent network. 12. The device of claim 11 , wherein extracting the two-dimensional temporal feature map comprises: segmenting the media stream into a plurality of clips; extracting features of respective ones of the plurality of clips to obtain a feature map of the media stream; and extracting, from features of one or more clips corresponding to a moment of the plurality of moments in the feature map of the media stream, features of this moment as a part of the two-dimensional temporal feature map. 13. The device of claim 11 , wherein determining the correlation comprises: sampling the plurality of moments at respective sample rates to determine a plurality of candidate moments, wherein the sample rates are adaptively adjusted based on lengths of respective ones of the plurality of moments; and determining a correlation between the plurality of candidate moments and the action in the media stream. 14. At least one non-transitory machine-readable medium comprising computer-executable instructions which, when executed by a device, cause the device to perform operations to: extract, from a media stream, a two-dimensional temporal feature map representing a plurality of moments within the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments; encode a sentence feature extracted from an input; fuse the encoded sentence feature with the two-dimensional temporal feature map into a unified subspace as a fused two-dimensional temporal map; apply a convolutional layer to the two-dimensional temporal feature map to obtain a further feature map having a same dimension as the two-dimensional temporal feature map, the convolutional layer comprises a dilated convolution and strides of the dilated convolution are configured to increase as lengths of the respective moments increase; generate a temporal adjacent network using the fused two-dimensional temporal map and the further feature map; determine, using the temporal adjacent network, a correlation be

Assignees

Inventors

Classifications

  • relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking · CPC title

  • Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title

  • Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title

  • Text processing (natural language analysis G06F40/20; semantic analysis G06F40/30; processing or translation of natural language G06F40/40) · CPC title

  • using natural language analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12406500B2 cover?
Various implementations of the subject matter relate to moment localization in media stream. In some implementations, a two-dimensional temporal feature map representing a plurality of moments within a media stream is extracted from the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments …
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06V20/48. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).