Systems and methods for increasing hardware accelerator performance in neural network applications
US-2023108883-A1 · Apr 6, 2023 · US
US12406500B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12406500-B2 |
| Application number | US-202017768815-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 19, 2020 |
| Priority date | Nov 1, 2019 |
| Publication date | Sep 2, 2025 |
| Grant date | Sep 2, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Various implementations of the subject matter relate to moment localization in media stream. In some implementations, a two-dimensional temporal feature map representing a plurality of moments within a media stream is extracted from the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments. A correlation between the plurality of moments and an action in the media stream is determined based on the two-dimensional temporal feature map.
Opening claim text (preview).
The invention claimed is: 1. A computer-implemented method, comprising: extracting, from a media stream, a two-dimensional temporal feature map representing a plurality of moments within the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments; encoding a sentence feature extracted from an input; fusing the encoded sentence feature with the two-dimensional temporal feature map into a unified subspace as a fused two-dimensional temporal map; applying a convolutional layer to the two-dimensional temporal feature map to obtain a further feature map having a same dimension as the two-dimensional temporal feature map, the convolutional layer comprises a dilated convolution and strides of the dilated convolution are configured to increase as lengths of the respective moments increase; generating a temporal adjacent network using the fused two-dimensional temporal map and the further feature map; determining, using the temporal adjacent network, a correlation between the plurality of moments and an action in the media stream; and identifying a matching a set of candidate moments for the input using the temporal adjacent network. 2. The method of claim 1 , wherein extracting the two-dimensional temporal feature map comprises: segmenting the media stream into a plurality of clips; extracting features of respective ones of the plurality of clips to obtain a feature map of the media stream; and extracting, from features of one or more clips corresponding to a moment of the plurality of moments in the feature map of the media stream, features of this moment as a part of the two-dimensional temporal feature map. 3. The method of claim 1 , wherein determining the correlation comprises: sampling the plurality of moments at respective sample rates to determine a plurality of candidate moments, wherein the sample rates are adaptively adjusted based on lengths of respective ones of the plurality of moments; and determining a correlation between the plurality of candidate moments and the action in the media stream. 4. The method of claim 3 , wherein the sample rates are configured to decrease as the lengths of the respective moments increase. 5. The method of claim 1 , wherein determining the correlation comprises: applying a convolutional layer to the two-dimensional temporal feature map to obtain a further feature map having a same dimension as the two-dimensional temporal feature map; and determining, based on the further feature map, scores of correlation between the plurality of moments and the action in the media stream. 6. The method of claim 1 , wherein determining the correlation comprises: in response to receiving a query for a particular action in the media stream, extracting a feature vector of the query; and determining the correlation based on the feature vector of the query and the two-dimensional temporal feature map. 7. The method of claim 6 , wherein determining the correlation comprises: fusing the feature vector of the query and the two-dimensional temporal feature map to generate a further two-dimensional temporal feature map having a same dimension as the two-dimensional temporal feature map; and determining, based on the further two-dimensional temporal feature map, the correlation between the plurality of moments and the particular action. 8. The method of claim 7 , wherein fusing the feature vector of the query and the two-dimensional temporal feature map comprises: generating the further two-dimensional temporal feature map by applying a Hadamard product to the feature vector of the query and the two-dimensional temporal feature map. 9. The method of claim 6 , wherein the query comprises a natural language query. 10. The method of claim 1 , wherein the media stream comprises an untrimmed media stream. 11. A device comprising: a processing unit; and a memory coupled to the processing unit and having instructions stored thereon, the instructions, when executed by the processing unit, causing the device to perform acts comprising: extracting, from a media stream, a two-dimensional temporal feature map representing a plurality of moments within the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments; encoding a sentence feature extracted from an input; fusing the encoded sentence feature with the two-dimensional temporal feature map into a unified subspace as a fused two-dimensional temporal map; applying a convolutional layer to the two-dimensional temporal feature map to obtain a further feature map having a same dimension as the two-dimensional temporal feature map, the convolutional layer comprises a dilated convolution and strides of the dilated convolution are configured to increase as lengths of the respective moments increase; generating a temporal adjacent network using the fused two-dimensional temporal map and the further feature map; determining, using the temporal adjacent network, a correlation between the plurality of moments and an action in the media stream; and identifying a matching a set of candidate moments for the input using the temporal adjacent network. 12. The device of claim 11 , wherein extracting the two-dimensional temporal feature map comprises: segmenting the media stream into a plurality of clips; extracting features of respective ones of the plurality of clips to obtain a feature map of the media stream; and extracting, from features of one or more clips corresponding to a moment of the plurality of moments in the feature map of the media stream, features of this moment as a part of the two-dimensional temporal feature map. 13. The device of claim 11 , wherein determining the correlation comprises: sampling the plurality of moments at respective sample rates to determine a plurality of candidate moments, wherein the sample rates are adaptively adjusted based on lengths of respective ones of the plurality of moments; and determining a correlation between the plurality of candidate moments and the action in the media stream. 14. At least one non-transitory machine-readable medium comprising computer-executable instructions which, when executed by a device, cause the device to perform operations to: extract, from a media stream, a two-dimensional temporal feature map representing a plurality of moments within the media stream, wherein the two-dimensional temporal feature map comprises a first dimension representing a start of a respective one of the plurality of moments and a second dimension representing an end of a respective one of the plurality of moments; encode a sentence feature extracted from an input; fuse the encoded sentence feature with the two-dimensional temporal feature map into a unified subspace as a fused two-dimensional temporal map; apply a convolutional layer to the two-dimensional temporal feature map to obtain a further feature map having a same dimension as the two-dimensional temporal feature map, the convolutional layer comprises a dilated convolution and strides of the dilated convolution are configured to increase as lengths of the respective moments increase; generate a temporal adjacent network using the fused two-dimensional temporal map and the further feature map; determine, using the temporal adjacent network, a correlation be
relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking · CPC title
Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title
Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title
Text processing (natural language analysis G06F40/20; semantic analysis G06F40/30; processing or translation of natural language G06F40/40) · CPC title
using natural language analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.