Generating representations of editing components using a machine learning model

US12394445B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12394445-B2
Application numberUS-202418411880-A
CountryUS
Kind codeB2
Filing dateJan 12, 2024
Priority dateJan 12, 2024
Publication dateAug 19, 2025
Grant dateAug 19, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure describes techniques for generating representations of editing components using a machine learning model. Images and guidance tokens are input into a first sub-model of the machine learning model. The machine learning model is trained to distinguish the editing components from raw materials and generate the representations of the editing components. Tokens corresponding to the images are generated by the first sub-model based on the images and the guidance tokens. The tokens corresponding to the images and the guidance tokens are input into a second sub-model of the machine learning model. The second sub-model comprises a cross-attention mechanism. An embedding indicative of at least one editing component is generated based on the tokens corresponding to the images and the guidance tokens by the second sub-model.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of generating representations of editing components using a machine learning model, comprising: inputting images and guidance tokens into a first sub-model of the machine learning model, wherein the images comprise content of raw materials and at least one editing component applied on the raw materials, wherein the guidance tokens provide prior knowledge of possible editing components, and wherein the machine learning model is trained to distinguish the editing components from raw materials and generate the representations of the editing components; generating tokens corresponding to the images by the first sub-model based on the images and the guidance tokens; inputting the tokens corresponding to the images and the guidance tokens into a second sub-model of the machine learning model, wherein the second sub-model comprises a cross-attention mechanism; and generating an embedding indicative of the at least one editing component based on the tokens corresponding to the images and the guidance tokens by the second sub-model. 2. The method of claim 1 , further comprising: generating a dataset of editing components, wherein each video in the dataset is rendered by applying each single editing component on both image materials and video materials, wherein the dataset of editing components enables to learn universal representations of different editing components. 3. The method of claim 2 , wherein the machine learning model is trained on at least a subset of the dataset of editing components. 4. The method of claim 3 , further comprising: guiding a process of training the machine learning model by a contrastive learning loss, wherein the contrastive learning loss is applied to pull positive samples closer while pushing negative samples away in embedding space. 5. The method of claim 2 , wherein the dataset of editing components comprises different types of editing components, and wherein the different types of editing components comprise video effect, animation, transition, filter, sticker, and text. 6. The method of claim 1 , wherein the first sub-model comprises a spatial encoder, and wherein the method further comprises: dividing each input image into patches by the spatial encoder; generating patch embedding by a linear projection layer of the spatial encoder; generating image tokens by adding positional embedding to each patch embedding inputting the guidance tokens to the spatial encoder; concatenating a class token to the image tokens and the guidance token to aggregate information; and generating an output class token corresponding to each input image by a plurality of transformer layers with multi-head self-attention. 7. The method of claim 6 , wherein the first sub-model further comprises a temporal encoder, and wherein the method further comprises: determining a temporal correlation between the input images by the temporal encoder, wherein the temporal encoder comprises a plurality of self-attention transformer blocks. 8. The method of claim 1 , wherein the generating an embedding indicative of the at least one editing component based on the tokens corresponding to the images and the guidance tokens by the second sub-model further comprising: adopting the guidance tokens as key-value tokens of a first transformer block of the second sub-model; extracting prior knowledge of editing component embedding by feeding a query token to the first transformer block; feeding a token output from the first transformer block and the tokens corresponding to the images output from the first sub-model into a second transformer block, wherein the second sub-model comprises a plurality of layers of the first and second transformer blocks; and generating the embedding indicative of the at least one editing component by the plurality of layers of the first and second transformer blocks. 9. The method of claim 1 , further comprising: building dynamic embedding queues to store recently generated embedding corresponding to the editing components, wherein the embedding queues enable to provide prior knowledge of the editing components. 10. The method of claim 1 , further comprising: adopting embedding centers corresponding to different types of editing components as the guidance tokens. 11. A system of generating representations of editing components using a machine learning model, comprising: at least one processor; and at least one memory communicatively coupled to the at least one processor and comprising computer-readable instructions that upon execution by the at least one processor cause the at least one processor to perform operations comprising: inputting images and guidance tokens into a first sub-model of the machine learning model, wherein the images comprise content of raw materials and at least one editing component applied on the raw materials, wherein the guidance tokens provide prior knowledge of possible editing components, and wherein the machine learning model is trained to distinguish the editing components from raw materials and generate the representations of the editing components; generating tokens corresponding to the images by the first sub-model based on the images and the guidance tokens; inputting the tokens corresponding to the images and the guidance tokens into a second sub-model of the machine learning model, wherein the second sub-model comprises a cross-attention mechanism; and generating an embedding indicative of the at least one editing component based on the tokens corresponding to the images and the guidance tokens by the second sub-model. 12. The system of claim 11 , the operations further comprising: generating a dataset of editing components, wherein each video in the dataset is rendered by applying each single editing component on both image materials and video materials, wherein the dataset of editing components enables to learn universal representations of different editing components. 13. The system of claim 12 , wherein the machine learning model is trained on at least a subset of the dataset of editing components, and wherein the operations further comprise: guiding a process of training the machine learning model by a contrastive learning loss, wherein the contrastive learning loss is applied to pull positive samples closer while pushing negative samples away in embedding space. 14. The system of claim 11 , wherein the first sub-model comprises a spatial encoder and a temporal encoder, and wherein the operations further comprise: dividing each input image into patches by the spatial encoder; generating patch embedding by a linear projection layer of the spatial encoder; generating image tokens by adding positional embedding to each patch embedding inputting the guidance tokens to the spatial encoder; concatenating a class token to the image tokens and the guidance token to aggregate information; generating an output class token corresponding to each input image by a plurality of transformer layers with multi-head self-attention; and determining a temporal correlation between the input images by the temporal encoder, wherein the temporal encoder comprises a plurality of self-attention transformer blocks. 15. The system of claim 11 , wherein the generating an embedding indicative of the at least one editing component based on the tokens corresponding to the images and the guidance tokens by the second sub-model further comprising: adopting the guidance tokens as key-value tokens of a first transformer block of the second sub-model; extracting prior knowledge of editing component embedding by feeding a query t

Assignees

Inventors

Classifications

  • G11B27/031Primary

    Electronic editing of digitised analogue information signals, e.g. audio or video signals · CPC title

  • G06V10/82Primary

    using neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12394445B2 cover?
The present disclosure describes techniques for generating representations of editing components using a machine learning model. Images and guidance tokens are input into a first sub-model of the machine learning model. The machine learning model is trained to distinguish the editing components from raw materials and generate the representations of the editing components. Tokens corresponding t…
Who is the assignee on this patent?
Lemon Inc, Beijing Zitiao Network Technology Co Ltd
What technology area does this patent fall under?
Primary CPC classification G11B27/031. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 19 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).