Video processing method and apparatus

US12579797B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12579797-B2
Application numberUS-202217852684-A
CountryUS
Kind codeB2
Filing dateJun 29, 2022
Priority dateDec 31, 2019
Publication dateMar 17, 2026
Grant dateMar 17, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A video clip location technology in the field of computer vision pertaining to artificial intelligence that provides a video processing method and apparatus. The method includes: obtaining a semantic feature of an input sentence; performing semantic enhancement on a video frame based on the semantic feature to obtain a video feature of the video frame, where the video feature includes the semantic feature; and determining, based on the semantic feature and the video feature, whether a video clip to which the video frame belongs is a target video clip corresponding to the input sentence. The method helps improve accuracy of recognizing a target video clip corresponding to an input sentence.

First claim

Opening claim text (preview).

What is claimed is: 1 . A video processing method comprising: extracting, with a convolutional neural network (CNN) combined with a recurrent neural network (RNN), a semantic feature of an input sentence; performing semantic enhancement on a video frame based on the semantic feature to obtain a video feature of the video frame, wherein the video feature is obtained through convolution processing using a convolution kernel determined based on the semantic feature, and wherein the semantic enhancement comprises contextual interaction among video frames by exchanging features between the video frame and at least one other video frame in the same video via a context interaction submodule; and determining, based on the semantic feature and the video feature, whether a video clip to which the video frame belongs is a target video clip corresponding to the input sentence. 2 . The video processing method according to claim 1 , wherein performing the semantic enhancement on the video frame based on the semantic feature, to obtain the video feature of the video frame further comprises: determining a word corresponding to the video frame in the input sentence; and performing semantic enhancement on the video frame based on a semantic feature of the word corresponding to the video frame to obtain the video feature of the video frame. 3 . The video processing method according to claim 1 , wherein performing the semantic enhancement on the video frame based on the semantic feature, to obtain the video feature of the video frame in the video further comprises: performing feature extraction, with the CNN, on the video frame based on the semantic feature, to obtain the video feature of the video frame. 4 . The video processing method according to claim 1 , further comprising: obtaining an initial video feature of the video frame; and performing the semantic enhancement on the video frame based on the semantic feature, to obtain the video feature of the video frame further comprises: performing semantic enhancement on the initial video feature based on the semantic feature, to obtain the video feature of the video frame. 5 . The video processing method according to claim 1 , further comprising: performing feature fusion on the video feature of the video frame by using a video feature of at least one other video frame to obtain a fused video feature of the video frame, wherein the at least one other video frame and the video frame belong to a same video; and determining, based on the semantic feature and the video feature, whether the video clip to which the video frame belongs is the target video clip corresponding to the input sentence further comprises: determining, based on the semantic feature and the fused video feature, whether the video clip to which the video frame belongs is the target video clip corresponding to the input sentence. 6 . The video processing method according to claim 1 , wherein determining, based on the semantic feature and the video feature, whether the video clip to which the video frame belongs is the target video clip corresponding to the input sentence further comprises: determining a hierarchical structure of the video clip in time domain based on the video feature; and determining, based on the semantic feature and the hierarchical structure, whether the video clip is the target video clip corresponding to the input sentence. 7 . A video processing apparatus, comprising a processor and a memory, wherein the memory is configured to store program instructions, and the processor is configured to invoke the program instructions to perform: extracting, with a convolutional neural network (CNN) combined with a recurrent neural network (RNN), a semantic feature of an input sentence; performing semantic enhancement on a video frame based on the semantic feature, to obtain a video feature of the video frame, wherein the video feature is obtained through convolution processing using a convolution kernel determined based on the semantic feature, and wherein the semantic enhancement comprises contextual interaction among video frames by exchanging features between the video frame and at least one other video frame in the same video via a context interaction submodule; and determining, based on the semantic feature and the video feature, whether a video clip to which the video frame belongs is a target video clip corresponding to the input sentence. 8 . The video processing apparatus according to claim 7 , wherein performing the semantic enhancement on the video frame based on the semantic feature, to obtain the video feature of the video frame further comprises: determining a word corresponding to the video frame in the input sentence; and performing semantic enhancement on the video frame based on a semantic feature of the word corresponding to the video frame to obtain the video feature of the video frame. 9 . The video processing apparatus according to claim 7 , wherein performing the semantic enhancement on the video frame based on the semantic feature, to obtain the video feature of the video frame in the video further comprises: performing feature extraction, with the CNN, on the video frame based on the semantic feature, to obtain the video feature of the video frame. 10 . The video processing apparatus according to claim 7 , wherein the processor is further configured to: obtain an initial video feature of the video frame; and performing the semantic enhancement on the video frame based on the semantic feature, to obtain the video feature of the video frame further comprises: performing semantic enhancement on the initial video feature based on the semantic feature, to obtain the video feature of the video frame. 11 . The video processing apparatus according to claim 7 , wherein the processor is further configured to: perform feature fusion on the video feature of the video frame by using a video feature of at least one other video frame to obtain a fused video feature of the video frame, wherein the at least one other video frame and the video frame belong to a same video; and determining, based on the semantic feature and the video feature, whether the video clip to which the video frame belongs is the target video clip corresponding to the input sentence further comprises: determining, based on the semantic feature and the fused video feature, whether the video clip to which the video frame belongs is the target video clip corresponding to the input sentence. 12 . The video processing apparatus according to claim 7 , wherein determining, based on the semantic feature and the video feature, whether the video clip to which the video frame belongs is the target video clip corresponding to the input sentence further comprises: determining a hierarchical structure of the video clip in time domain based on the video feature; and determining, based on the semantic feature and the hierarchical structure, whether the video clip is the target video clip corresponding to the input sentence. 13 . A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable medium stores program code to be executed by a device, and the program code is used for performing: extracting, with a convolutional neural network (CNN) combined with a recurrent neural network (RNN), a semantic feature of an input sentence; performing semantic enhancement on a video frame based on the semantic feature, to obtain a video feature of the video frame, wherein the video feature is obtained through convolution processing using a convolution kernel determined based on the semantic feature, and wh

Assignees

Inventors

Classifications

  • Matching video sequences · CPC title

  • G06V20/46Primary

    Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

  • of extracted features · CPC title

  • Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title

  • of extracted features · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12579797B2 cover?
A video clip location technology in the field of computer vision pertaining to artificial intelligence that provides a video processing method and apparatus. The method includes: obtaining a semantic feature of an input sentence; performing semantic enhancement on a video frame based on the semantic feature to obtain a video feature of the video frame, where the video feature includes the seman…
Who is the assignee on this patent?
Huawei Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06V20/46. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 17 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).