Method and apparatus for detecting temporal action of video, electronic device and storage medium

US11615140B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11615140-B2
Application numberUS-202117144523-A
CountryUS
Kind codeB2
Filing dateJan 8, 2021
Priority dateJan 10, 2020
Publication dateMar 28, 2023
Grant dateMar 28, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes screening, by a video-clip screening module in a video description model, a plurality of video proposal clips acquired from a video to be analyzed, to acquire a plurality of video clips suitable for description. The plural video proposal clips acquired from the video to be analyzed may be screened by the video-clip screening module to acquire the plural video clips suitable for description; and then, each video clip is described by a video-clip describing module, thus avoiding description of all the video proposal clips, only describing the screened video clips which have strong correlation with the video and are suitable for description, removing the interference of the description of the video clips which are not suitable for description in the description of the video, guaranteeing the accuracy of the final descriptions of the video clips, and improving the quality of the descriptions of the video clips.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for generating descriptions of video clips, comprising: screening, by a video-clip screening module in a video description model, a plurality of video proposal clips acquired from a video to be analyzed, so as to acquire a plurality of video clips suitable for description; and describing each video clip by a video-clip describing module in the video description model. 2. The method according to claim 1 , wherein the video-clip screening module and the video-clip describing module in the video description model are trained jointly. 3. The method according to claim 2 , wherein before the screening, by a video-clip screening module in a video description model, a plurality of pre-acquired video proposal clips, so as to acquire a plurality of video clips suitable for description, the method further comprises: extracting, by a video-clip proposing module in the pre-trained video description model, the plural video proposal clips from the video to be analyzed; or acquiring the plural video proposal clips manually extracted from the video to be analyzed; further, the video-clip proposing module, the video-clip screening module and the video-clip describing module in the video description model are trained jointly if the video description model further comprises the video-clip proposing module. 4. The method according to claim 3 , wherein the extracting, by a video-clip proposing module in the pre-trained video description model, the plural video proposal clips from the video to be analyzed comprises: extracting each video frame in the video to be analyzed; extracting video frame features in the video frames by at least one of a pre-trained first sub-model, a pre-trained second sub-model and a pre-trained third sub-model respectively to obtain corresponding video frame feature sequences, wherein at least one video frame feature sequence is obtained in total; for each video frame feature sequence, acquiring a corresponding clip confidence map by a pre-trained confidence statistical model, wherein at least one clip confidence map is obtained in total; and acquiring the plural video proposal clips in the video to be analyzed according to the at least one clip confidence map. 5. The method according to claim 4 , wherein the acquiring the plural video proposal clips in the video to be analyzed according to the at least one clip confidence map comprises: if only one clip confidence map is comprised, acquiring top N video clips according to the decreasing confidences of the video clips in the clip confidence map as the corresponding video proposal clips; and if at least two clip confidence maps are comprised, performing weighted fusion on the confidences of the same clips in the at least two clip confidence maps to obtain the fused confidences of the clips; and acquiring top N video clips according to the decreasing fused confidences of the clips as the corresponding video proposal clips. 6. The method according to claim 4 , wherein the screening, by a video-clip screening module in a video description model, a plurality of video proposal clips, so as to acquire a plurality of video clips suitable for description comprises: acquiring the feature of the video to be analyzed; acquiring the feature of each video proposal clip; and screening the plural video clips suitable for description from the plural video proposal clips using a pre-trained classification model, the feature of the video to be analyzed and the feature of each video proposal clip. 7. The method according to claim 6 , wherein the screening the plural video clips suitable for description from the plural video proposal clips using a pre-trained classification model, the feature of the video to be analyzed and the feature of each video proposal clip comprises: inputting the feature of each of the plural video proposal clips and the feature of the video to be analyzed into the classification model, and acquiring a probability value output by the classification model; judging whether the output probability value is greater than a preset probability threshold; and if yes, determining the video proposal clip as one video clip suitable for description, wherein the plural video clips suitable for description are obtained in total. 8. A method for training a video description model, comprising: independently pre-training a video-clip screening module and a video-clip describing module in the video description model; and jointly training the pre-trained video-clip screening module and the pre-trained video-clip describing module. 9. The method according to claim 8 , wherein if the video description model further comprises a video-clip proposing module, the method further comprises: independently pre-training the video-clip proposing module in the video description model; and jointly training the pre-trained video-clip proposing module, the pre-trained video-clip screening module and the pre-trained video-clip describing module. 10. The method according to claim 9 , wherein the jointly training the pre-trained video-clip proposing module, the pre-trained video-clip screening module and the pre-trained video-clip describing module comprises: keeping any two of the video-clip proposing module, the video-clip screening module and the video-clip describing module fixed in sequence, and training the third module with a reinforcement learning method until the three modules are trained. 11. An electronic device, comprising: at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform a method for generating descriptions of video clips according to claim 1 . 12. The electronic device according to claim 11 , wherein the video-clip screening module and the video-clip describing module in the video description model are trained jointly. 13. The electronic device according to claim 12 , wherein before the screening, by a video-clip screening module in a video description model, a plurality of pre-acquired video proposal clips, so as to acquire a plurality of video clips suitable for description, the method further comprises: extracting, by a video-clip proposing module in the pre-trained video description model, the plural video proposal clips from the video to be analyzed; or acquiring the plural video proposal clips manually extracted from the video to be analyzed; further, the video-clip proposing module, the video-clip screening module and the video-clip describing module in the video description model are trained jointly if the video description model further comprises the video-clip proposing module. 14. The electronic device according to claim 13 , wherein the extracting, by a video-clip proposing module in the pre-trained video description model, the plural video proposal clips from the video to be analyzed comprises: extracting each video frame in the video to be analyzed; extracting video frame features in the video frames by at least one of a pre-trained first sub-model, a pre-trained second sub-model and a pre-trained third sub-model respectively to obtain corresponding video frame feature sequences, wherein at least one video frame feature sequence is obtained in total; for each video frame feature sequence, acquiring a corresponding clip confidence map by a pre-trained confidence statistical model, wherein at least one clip confidence map is obtained in total; and acquiring the plural video proposal c

Assignees

Inventors

Classifications

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Reinforcement learning · CPC title

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11615140B2 cover?
A method includes screening, by a video-clip screening module in a video description model, a plurality of video proposal clips acquired from a video to be analyzed, to acquire a plurality of video clips suitable for description. The plural video proposal clips acquired from the video to be analyzed may be screened by the video-clip screening module to acquire the plural video clips suitable fo…
Who is the assignee on this patent?
Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification H04N21/84. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Mar 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).