Video sequence selection method, computer device, and storage medium

US12008810B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12008810-B2
Application numberUS-202117225969-A
CountryUS
Kind codeB2
Filing dateApr 8, 2021
Priority dateMar 5, 2019
Publication dateJun 11, 2024
Grant dateJun 11, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This application discloses a video sequence selection method, applicable to a computer device, the method including: receiving a to-be-matched video and a to-be-matched text, the to-be-matched text having a to-be-matched text feature sequence; invoking a spatiotemporal candidate region generator to extract a spatiotemporal candidate region set from the to-be-matched video, the spatiotemporal candidate region set including N spatiotemporal candidate regions; performing feature extraction on each spatiotemporal candidate region by using a convolutional neural network, to obtain N to-be-matched video feature sequences; invoking an attention-based interactor to obtain a matching score corresponding to each spatiotemporal candidate region, the matching score being used for representing a matching relationship between the spatiotemporal candidate region and the to-be-matched text; and selecting a target spatiotemporal candidate region from the spatiotemporal candidate region set according to the matching score corresponding to each spatiotemporal candidate region, and outputting the target spatiotemporal candidate region. In this application, an association between the video and the text in time sequence is considered during matching, thereby increasing a degree of matching between a video sequence and the text.

First claim

Opening claim text (preview).

What is claimed is: 1. A video sequence selection method, applicable to a computer device, the method comprising: receiving, by the computer device, a to-be-matched video and a to-be-matched text, wherein the to-be-matched text is not part of the to-be-matched video, the to-be-matched video comprising a plurality of frames, the to-be-matched text comprising at least one word, and the to-be-matched text having a to-be-matched text feature sequence corresponding to a target object; invoking, by the computer device, a spatiotemporal candidate region generator to extract a spatiotemporal candidate region set from the to-be-matched video, the spatiotemporal candidate region set comprising N spatiotemporal candidate regions, N being an integer greater than or equal to 1, and each spatiotemporal candidate region corresponding to images within a respective video sequence in the to-be-matched video that include a candidate object; performing, by the computer device, feature extraction on each spatiotemporal candidate region in the spatiotemporal candidate region set by using a convolutional neural network, to obtain N to-be-matched video feature sequences, each to-be-matched video feature sequence corresponding to a respective spatiotemporal candidate region in the spatiotemporal candidate region set and representing a respective candidate object in the respective spatiotemporal candidate region; invoking, by the computer device, an attention-based interactor to obtain a matching score corresponding to each spatiotemporal candidate region, the interactor being configured to process the to-be-matched video feature sequence and the to-be-matched text feature sequence, and the matching score being used for representing a matching relationship between a respective candidate object in the spatiotemporal candidate region and the target object corresponding to the to-be-matched text; and selecting, by the computer device, from the spatiotemporal candidate region set, a target spatiotemporal candidate region having a highest matching score outputted by the interactor, and outputting the target spatiotemporal candidate region as representing the target object corresponding to the to-be-matched text. 2. The method according to claim 1 , wherein the invoking, by the computer device, a spatiotemporal candidate region generator to extract a spatiotemporal candidate region set from the to-be-matched video comprises: invoking, by the computer device, the spatiotemporal candidate region generator to obtain a candidate region and a confidence score of each frame in the to-be-matched video, each candidate region corresponding to a respective confidence score; invoking, by the computer device, the spatiotemporal candidate region generator to obtain a degree of overlap of similar image content between every two adjacent frames in the to-be-matched video; and invoking, by the computer device, the spatiotemporal candidate region generator to generate the spatiotemporal candidate region set according to the candidate region and the confidence score of each frame and the overlap degrees. 3. The method according to claim 1 , wherein the invoking, by the computer device, an attention-based interactor to obtain a matching score corresponding to each spatiotemporal candidate region comprises: invoking, by the computer device for each spatiotemporal candidate region, an encoder of the interactor to encode the to-be-matched video feature sequence corresponding to the spatiotemporal candidate region, to obtain a visual feature set, the visual feature set comprising at least one visual feature of a candidate object in the spatiotemporal candidate region; invoking, by the computer device, the encoder of the interactor to encode the to-be-matched text feature sequence, to obtain a text feature set, the text feature set comprising at least one text feature of the target object; invoking, by the computer device, the interactor to determine a visual text feature set according to the visual feature set and the text feature set, the visual text feature set comprising at least one visual text feature, the visual text feature representing a visual feature-based text feature; and invoking, by the computer device, the interactor to determine the matching score corresponding to the candidate object in the spatiotemporal candidate region and the target object according to the visual text feature set and the visual feature set. 4. The method according to claim 3 , wherein the invoking, by the computer device, an encoder of the interactor to encode the to-be-matched video feature sequence corresponding to the spatiotemporal candidate region, to obtain a visual feature set comprises: calculating the visual feature set in the following manner: H p ={h t p } t=1 t p , and h t p =LSTM p ( f t p ,h t-1 p ), H p representing the visual feature set, h t p representing a t th visual feature in the visual feature set, t p representing a time step in the spatiotemporal candidate region, h t-1 p representing a (t−1) th visual feature in the visual feature set, LSTM p ( ) representing a first long short-term memory (LSTM) network encoder, and f t p representing a t th row of features in the to-be-matched video feature sequence; and the invoking, by the computer device, the encoder of the interactor to encode the to-be-matched text feature sequence, to obtain a text feature set comprises: calculating the text feature set in the following manner: H q ={h t q } t=1 t q , and h t q =LSTM q ( f t q ,h t-1 q ), H q representing the text feature set, h t q representing a t th text feature in the text feature set, t q representing a word quantity of the to-be-matched text, h t-1 q representing a (t−1) th text feature in the text feature set, LSTM q ( ) representing a second LSTM encoder, and f t q representing a t th row of features in the to-be-matched text feature sequence. 5. The method according to claim 3 , wherein the invoking, by the computer device, the interactor to determine a visual text feature set according to the visual feature set and the text feature set comprises: invoking, by the computer device, the interactor to calculate an attention weight of the text feature corresponding to the visual feature according to the visual feature set and the text feature set; invoking, by the computer device, the interactor to calculate a normalized attention weight of the text feature corresponding to the visual feature according to the attention weight; and invoking, by the computer device, the interactor to calculate the visual text feature set according to the normalized attention weight and the text feature. 6. The method according to claim 5 , wherein the invoking, by the computer device, the interactor to calculate an attention weight of the text feature corresponding to the visual feature according to the visual feature set and the text feature set comprises: calculate the attention weight in the following manner: e i,j =w T tanh( W q h j q +W p h i p +b 1 )+ b 2 ; e i,j representing an attention weight of a j th text feature corresponding to an i th visual feature, h j q representing the j th text feature, h i p representing the i th visual feature, W T representing a first model parameter, W q representing a second model parameter, W p representing a third model parameter, b 1 representing a fourth model parameter, b 2 representing a fifth model parameter, and tanh( ) representing a hyperbolic tangent function; the invoking, by the computer device, the interactor to calculate a normalized attention weight of the text feature corresponding to the visual feature according to the atte

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Engine management systems · CPC title

  • Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12008810B2 cover?
This application discloses a video sequence selection method, applicable to a computer device, the method including: receiving a to-be-matched video and a to-be-matched text, the to-be-matched text having a to-be-matched text feature sequence; invoking a spatiotemporal candidate region generator to extract a spatiotemporal candidate region set from the to-be-matched video, the spatiotemporal ca…
Who is the assignee on this patent?
Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06V20/46. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 11 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).