Method and system to generate targeted captions and summarize long, continuous media files
US-2018225519-A1 · Aug 9, 2018 · US
US11615140B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11615140-B2 |
| Application number | US-202117144523-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 8, 2021 |
| Priority date | Jan 10, 2020 |
| Publication date | Mar 28, 2023 |
| Grant date | Mar 28, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes screening, by a video-clip screening module in a video description model, a plurality of video proposal clips acquired from a video to be analyzed, to acquire a plurality of video clips suitable for description. The plural video proposal clips acquired from the video to be analyzed may be screened by the video-clip screening module to acquire the plural video clips suitable for description; and then, each video clip is described by a video-clip describing module, thus avoiding description of all the video proposal clips, only describing the screened video clips which have strong correlation with the video and are suitable for description, removing the interference of the description of the video clips which are not suitable for description in the description of the video, guaranteeing the accuracy of the final descriptions of the video clips, and improving the quality of the descriptions of the video clips.
Opening claim text (preview).
What is claimed is: 1. A method for generating descriptions of video clips, comprising: screening, by a video-clip screening module in a video description model, a plurality of video proposal clips acquired from a video to be analyzed, so as to acquire a plurality of video clips suitable for description; and describing each video clip by a video-clip describing module in the video description model. 2. The method according to claim 1 , wherein the video-clip screening module and the video-clip describing module in the video description model are trained jointly. 3. The method according to claim 2 , wherein before the screening, by a video-clip screening module in a video description model, a plurality of pre-acquired video proposal clips, so as to acquire a plurality of video clips suitable for description, the method further comprises: extracting, by a video-clip proposing module in the pre-trained video description model, the plural video proposal clips from the video to be analyzed; or acquiring the plural video proposal clips manually extracted from the video to be analyzed; further, the video-clip proposing module, the video-clip screening module and the video-clip describing module in the video description model are trained jointly if the video description model further comprises the video-clip proposing module. 4. The method according to claim 3 , wherein the extracting, by a video-clip proposing module in the pre-trained video description model, the plural video proposal clips from the video to be analyzed comprises: extracting each video frame in the video to be analyzed; extracting video frame features in the video frames by at least one of a pre-trained first sub-model, a pre-trained second sub-model and a pre-trained third sub-model respectively to obtain corresponding video frame feature sequences, wherein at least one video frame feature sequence is obtained in total; for each video frame feature sequence, acquiring a corresponding clip confidence map by a pre-trained confidence statistical model, wherein at least one clip confidence map is obtained in total; and acquiring the plural video proposal clips in the video to be analyzed according to the at least one clip confidence map. 5. The method according to claim 4 , wherein the acquiring the plural video proposal clips in the video to be analyzed according to the at least one clip confidence map comprises: if only one clip confidence map is comprised, acquiring top N video clips according to the decreasing confidences of the video clips in the clip confidence map as the corresponding video proposal clips; and if at least two clip confidence maps are comprised, performing weighted fusion on the confidences of the same clips in the at least two clip confidence maps to obtain the fused confidences of the clips; and acquiring top N video clips according to the decreasing fused confidences of the clips as the corresponding video proposal clips. 6. The method according to claim 4 , wherein the screening, by a video-clip screening module in a video description model, a plurality of video proposal clips, so as to acquire a plurality of video clips suitable for description comprises: acquiring the feature of the video to be analyzed; acquiring the feature of each video proposal clip; and screening the plural video clips suitable for description from the plural video proposal clips using a pre-trained classification model, the feature of the video to be analyzed and the feature of each video proposal clip. 7. The method according to claim 6 , wherein the screening the plural video clips suitable for description from the plural video proposal clips using a pre-trained classification model, the feature of the video to be analyzed and the feature of each video proposal clip comprises: inputting the feature of each of the plural video proposal clips and the feature of the video to be analyzed into the classification model, and acquiring a probability value output by the classification model; judging whether the output probability value is greater than a preset probability threshold; and if yes, determining the video proposal clip as one video clip suitable for description, wherein the plural video clips suitable for description are obtained in total. 8. A method for training a video description model, comprising: independently pre-training a video-clip screening module and a video-clip describing module in the video description model; and jointly training the pre-trained video-clip screening module and the pre-trained video-clip describing module. 9. The method according to claim 8 , wherein if the video description model further comprises a video-clip proposing module, the method further comprises: independently pre-training the video-clip proposing module in the video description model; and jointly training the pre-trained video-clip proposing module, the pre-trained video-clip screening module and the pre-trained video-clip describing module. 10. The method according to claim 9 , wherein the jointly training the pre-trained video-clip proposing module, the pre-trained video-clip screening module and the pre-trained video-clip describing module comprises: keeping any two of the video-clip proposing module, the video-clip screening module and the video-clip describing module fixed in sequence, and training the third module with a reinforcement learning method until the three modules are trained. 11. An electronic device, comprising: at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform a method for generating descriptions of video clips according to claim 1 . 12. The electronic device according to claim 11 , wherein the video-clip screening module and the video-clip describing module in the video description model are trained jointly. 13. The electronic device according to claim 12 , wherein before the screening, by a video-clip screening module in a video description model, a plurality of pre-acquired video proposal clips, so as to acquire a plurality of video clips suitable for description, the method further comprises: extracting, by a video-clip proposing module in the pre-trained video description model, the plural video proposal clips from the video to be analyzed; or acquiring the plural video proposal clips manually extracted from the video to be analyzed; further, the video-clip proposing module, the video-clip screening module and the video-clip describing module in the video description model are trained jointly if the video description model further comprises the video-clip proposing module. 14. The electronic device according to claim 13 , wherein the extracting, by a video-clip proposing module in the pre-trained video description model, the plural video proposal clips from the video to be analyzed comprises: extracting each video frame in the video to be analyzed; extracting video frame features in the video frames by at least one of a pre-trained first sub-model, a pre-trained second sub-model and a pre-trained third sub-model respectively to obtain corresponding video frame feature sequences, wherein at least one video frame feature sequence is obtained in total; for each video frame feature sequence, acquiring a corresponding clip confidence map by a pre-trained confidence statistical model, wherein at least one clip confidence map is obtained in total; and acquiring the plural video proposal c
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Reinforcement learning · CPC title
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.