Real scene image editing method based on hierarchically classified text guidance
US-2025005825-A1 · Jan 2, 2025 · US
US12394445B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12394445-B2 |
| Application number | US-202418411880-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 12, 2024 |
| Priority date | Jan 12, 2024 |
| Publication date | Aug 19, 2025 |
| Grant date | Aug 19, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present disclosure describes techniques for generating representations of editing components using a machine learning model. Images and guidance tokens are input into a first sub-model of the machine learning model. The machine learning model is trained to distinguish the editing components from raw materials and generate the representations of the editing components. Tokens corresponding to the images are generated by the first sub-model based on the images and the guidance tokens. The tokens corresponding to the images and the guidance tokens are input into a second sub-model of the machine learning model. The second sub-model comprises a cross-attention mechanism. An embedding indicative of at least one editing component is generated based on the tokens corresponding to the images and the guidance tokens by the second sub-model.
Opening claim text (preview).
What is claimed is: 1. A method of generating representations of editing components using a machine learning model, comprising: inputting images and guidance tokens into a first sub-model of the machine learning model, wherein the images comprise content of raw materials and at least one editing component applied on the raw materials, wherein the guidance tokens provide prior knowledge of possible editing components, and wherein the machine learning model is trained to distinguish the editing components from raw materials and generate the representations of the editing components; generating tokens corresponding to the images by the first sub-model based on the images and the guidance tokens; inputting the tokens corresponding to the images and the guidance tokens into a second sub-model of the machine learning model, wherein the second sub-model comprises a cross-attention mechanism; and generating an embedding indicative of the at least one editing component based on the tokens corresponding to the images and the guidance tokens by the second sub-model. 2. The method of claim 1 , further comprising: generating a dataset of editing components, wherein each video in the dataset is rendered by applying each single editing component on both image materials and video materials, wherein the dataset of editing components enables to learn universal representations of different editing components. 3. The method of claim 2 , wherein the machine learning model is trained on at least a subset of the dataset of editing components. 4. The method of claim 3 , further comprising: guiding a process of training the machine learning model by a contrastive learning loss, wherein the contrastive learning loss is applied to pull positive samples closer while pushing negative samples away in embedding space. 5. The method of claim 2 , wherein the dataset of editing components comprises different types of editing components, and wherein the different types of editing components comprise video effect, animation, transition, filter, sticker, and text. 6. The method of claim 1 , wherein the first sub-model comprises a spatial encoder, and wherein the method further comprises: dividing each input image into patches by the spatial encoder; generating patch embedding by a linear projection layer of the spatial encoder; generating image tokens by adding positional embedding to each patch embedding inputting the guidance tokens to the spatial encoder; concatenating a class token to the image tokens and the guidance token to aggregate information; and generating an output class token corresponding to each input image by a plurality of transformer layers with multi-head self-attention. 7. The method of claim 6 , wherein the first sub-model further comprises a temporal encoder, and wherein the method further comprises: determining a temporal correlation between the input images by the temporal encoder, wherein the temporal encoder comprises a plurality of self-attention transformer blocks. 8. The method of claim 1 , wherein the generating an embedding indicative of the at least one editing component based on the tokens corresponding to the images and the guidance tokens by the second sub-model further comprising: adopting the guidance tokens as key-value tokens of a first transformer block of the second sub-model; extracting prior knowledge of editing component embedding by feeding a query token to the first transformer block; feeding a token output from the first transformer block and the tokens corresponding to the images output from the first sub-model into a second transformer block, wherein the second sub-model comprises a plurality of layers of the first and second transformer blocks; and generating the embedding indicative of the at least one editing component by the plurality of layers of the first and second transformer blocks. 9. The method of claim 1 , further comprising: building dynamic embedding queues to store recently generated embedding corresponding to the editing components, wherein the embedding queues enable to provide prior knowledge of the editing components. 10. The method of claim 1 , further comprising: adopting embedding centers corresponding to different types of editing components as the guidance tokens. 11. A system of generating representations of editing components using a machine learning model, comprising: at least one processor; and at least one memory communicatively coupled to the at least one processor and comprising computer-readable instructions that upon execution by the at least one processor cause the at least one processor to perform operations comprising: inputting images and guidance tokens into a first sub-model of the machine learning model, wherein the images comprise content of raw materials and at least one editing component applied on the raw materials, wherein the guidance tokens provide prior knowledge of possible editing components, and wherein the machine learning model is trained to distinguish the editing components from raw materials and generate the representations of the editing components; generating tokens corresponding to the images by the first sub-model based on the images and the guidance tokens; inputting the tokens corresponding to the images and the guidance tokens into a second sub-model of the machine learning model, wherein the second sub-model comprises a cross-attention mechanism; and generating an embedding indicative of the at least one editing component based on the tokens corresponding to the images and the guidance tokens by the second sub-model. 12. The system of claim 11 , the operations further comprising: generating a dataset of editing components, wherein each video in the dataset is rendered by applying each single editing component on both image materials and video materials, wherein the dataset of editing components enables to learn universal representations of different editing components. 13. The system of claim 12 , wherein the machine learning model is trained on at least a subset of the dataset of editing components, and wherein the operations further comprise: guiding a process of training the machine learning model by a contrastive learning loss, wherein the contrastive learning loss is applied to pull positive samples closer while pushing negative samples away in embedding space. 14. The system of claim 11 , wherein the first sub-model comprises a spatial encoder and a temporal encoder, and wherein the operations further comprise: dividing each input image into patches by the spatial encoder; generating patch embedding by a linear projection layer of the spatial encoder; generating image tokens by adding positional embedding to each patch embedding inputting the guidance tokens to the spatial encoder; concatenating a class token to the image tokens and the guidance token to aggregate information; generating an output class token corresponding to each input image by a plurality of transformer layers with multi-head self-attention; and determining a temporal correlation between the input images by the temporal encoder, wherein the temporal encoder comprises a plurality of self-attention transformer blocks. 15. The system of claim 11 , wherein the generating an embedding indicative of the at least one editing component based on the tokens corresponding to the images and the guidance tokens by the second sub-model further comprising: adopting the guidance tokens as key-value tokens of a first transformer block of the second sub-model; extracting prior knowledge of editing component embedding by feeding a query t
Electronic editing of digitised analogue information signals, e.g. audio or video signals · CPC title
using neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.