Method for processing audio and video information, electronic device and storage medium
US-2022148313-A1 · May 12, 2022 · US
US12112539B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12112539-B2 |
| Application number | US-202117450158-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 6, 2021 |
| Priority date | Nov 27, 2020 |
| Publication date | Oct 8, 2024 |
| Grant date | Oct 8, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A video processing method, an electronic device and a storage medium are provided, and relate to the field of artificial intelligence, and particularly relates to the fields of deep learning, model training, knowledge mapping, video processing and the like. The method includes: acquiring a plurality of first video frames, and performing fine-grained splitting on the plurality of first video frames to obtain a plurality of second video frames; performing feature encoding on the plurality of second video frames according to multi-mode information related to the plurality of second video frames, to obtain feature fusion information for characterizing fusion of the multi-mode information; and performing similarity matching on the plurality of second video frames according to the feature fusion information, and obtaining a target video according to a result of the similarity matching.
Opening claim text (preview).
What is claimed is: 1. A video processing method, comprising: acquiring a plurality of first video frames, and performing fine-grained splitting on the plurality of first video frames to obtain a plurality of second video frames; performing feature encoding on the plurality of second video frames according to multi-mode information related to the plurality of second video frames, to obtain feature fusion information for characterizing fusion of the multi-mode information; performing similarity matching on the plurality of second video frames according to the feature fusion information, and obtaining a target video according to a result of the similarity matching; identifying the multi-mode information from the plurality of second video frames according to a pre-trained first neural network model, wherein the identifying the multi-mode information from the plurality of second video frames according to the pre-trained first neural network model comprises: identifying knowledge map information according to a knowledge map extractor in the first neural network model; identifying text information according to a text extractor in the first neural network model; identifying audio information according to an audio extractor in the first neural network model; identifying hue information according to a hue extractor in the first neural network model; identifying object information according to an object extractor in the first neural network model; identifying action information according to an action extractor in the first neural network model; and wherein the multi-mode information comprises: at least one of the knowledge map information, the text information, the audio information, the hue information, the object information, and the action information; distinguishing respective types of information in the multi-mode information according to a second neural network model; identifying time sequence information related to the multi-mode information according to a third neural network model; and fusing output results of the first neural network model, the second neural network model, and the third neural network model to obtain the feature fusion information. 2. The method of claim 1 , wherein the acquiring the plurality of first video frames, and performing the fine-grained splitting on the plurality of first video frames to obtain the plurality of second video frames comprises: performing the fine-grained splitting on the plurality of first video frames according to a parameter for characterizing shot and color transformation to obtain the plurality of second video frames. 3. The method of claim 1 , wherein the performing the feature encoding on the plurality of second video frames according to the multi-mode information related to the plurality of second video frames, to obtain the feature fusion information for characterizing the fusion of the multi-mode information, comprises: performing feature extraction and feature fusion processing on the plurality of second video frames according to the multi-mode information to obtain the feature fusion information. 4. The method of claim 1 , wherein the performing the similarity matching on the plurality of second video frames according to the feature fusion information, and obtaining the target video according to the result of the similarity matching, comprises: scoring similarities of the plurality of second video frames according to the feature fusion information, and taking a result of the scoring as the result of the similarity matching; and in a case that the result of the similarity matching is that adjacent video frames for a same event content are similar, performing video merging on the adjacent video frames until completing merging of the plurality of second video frames according to the adjacent video frames, respectively, and obtaining the target video according to a result of the video merging. 5. An electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform operations of: acquiring a plurality of first video frames, and performing fine-grained splitting on the plurality of first video frames to obtain a plurality of second video frames; performing feature encoding on the plurality of second video frames according to multi-mode information related to the plurality of second video frames, to obtain feature fusion information for characterizing fusion of the multi-mode information; performing similarity matching on the plurality of second video frames according to the feature fusion information, and obtaining a target video according to a result of the similarity matching; identifying the multi-mode information from the plurality of second video frames according to a pre-trained first neural network model, wherein when the instructions are executed by the at least one processor to enable the at least one processor to identify the multi-mode information from the plurality of second video frames according to a pre-trained first neural network model, the instructions are executed by the at least one processor to enable the at least one processor to specifically perform operations of: identifying knowledge map information according to a knowledge map extractor in the first neural network model; identifying text information according to a text extractor in the first neural network model; identifying audio information according to an audio extractor in the first neural network model; identifying hue information according to a hue extractor in the first neural network model; identifying object information according to an object extractor in the first neural network model; identifying action information according to an action extractor in the first neural network model; and wherein the multi-mode information comprises: at least one of the knowledge map information, the text information, the audio information, the hue information, the object information, and the action information; distinguishing respective types of information in the multi-mode information according to a second neural network model; identifying time sequence information related to the multi-mode information according to a third neural network model; and fusing output results of the first neural network model, the second neural network model, and the third neural network model to obtain the feature fusion information. 6. The electronic device of claim 5 , wherein when the instructions are executed by the at least one processor to enable the at least one processor to acquire the plurality of first video frames, and perform the fine-grained splitting on the plurality of first video frames to obtain the plurality of second video frames, the instructions are executed by the at least one processor to enable the at least one processor to specifically perform an operation of: performing the fine-grained splitting on the plurality of first video frames according to a parameter for characterizing shot and color transformation to obtain the plurality of second video frames. 7. The electronic device of claim 5 , wherein when the instructions are executed by the at least one processor to enable the at least one processor to perform the feature encoding on the plurality of second video frames according to the multi-mode information related to the plurality of second video frames, to obtain the feature fusion information for characterizing the fusion of the multi-mode information, the instructions are executed by the at least one processor to enable the at least one processor to speci
Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes · CPC title
of extracted features · CPC title
Combinations of networks · CPC title
Fusion techniques · CPC title
Matching criteria, e.g. proximity measures · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.