Generation of story videos corresponding to user input using generative models
US-12299796-B2 · May 13, 2025 · US
US12494004B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12494004-B2 |
| Application number | US-202318350876-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 12, 2023 |
| Priority date | Mar 8, 2023 |
| Publication date | Dec 9, 2025 |
| Grant date | Dec 9, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments described herein provide a feedback based instructional image editing framework that employs a diffusion process to follow user instruction for image editing. A diffusion model is fine-tuned using a reward model, which may be trained via human annotation. The training of the reward model may be done by having the image editing model output a number of images, which a human annotator ranks based on their alignment with the original image and a given instruction.
Opening claim text (preview).
What is claimed is: 1 . A method of training a neural network based instructional image editing model, the method comprising: receiving, via a data interface, a training dataset comprising an input image, an editing instruction, and an edited image; generating a noisy latent image representation of the edited image by gradually adding noise to a latent representation of the edited image; generating, by the neural network based instructional image editing model, an estimated noise from the noisy latent image representation based on the input image and the editing instruction; computing, by a neural network based reward model, a reward score indicative of an alignment level between the edited image and the input image according to the editing instruction; computing a loss objective based on the added noise, the estimated noise, and the reward score; and training the neural network based instructional image editing model based on the computed loss objective via backpropagation. 2 . The method of claim 1 , further comprising: receiving, via the data interface, a second training dataset comprising a second input image, and a second editing instruction; generating, by the neural network based instructional image editing model, a plurality of candidate edited images based on the second input image and the second editing instruction; displaying the plurality of candidate edited images on a display; receiving an indication of a quality associated with the plurality of candidate edited images; and training the neural network based reward model based on the indication. 3 . The method of claim 2 , wherein the indication comprises a ranking of the plurality of candidate edited images. 4 . The method of claim 1 , wherein the neural network based instructional image editing model comprises a series of neural network based denoising models, wherein each neural network based denoising model generates a respective estimated noise from an input image representations, and wherein the estimated noise from the noisy latent image representation is one of the respective estimated noise. 5 . The method of claim 1 , wherein the computing the loss objective comprises weighting the loss objective based on the reward score. 6 . The method of claim 1 , further comprising: modifying the editing instruction based on the reward score. 7 . The method of claim 6 , wherein the modifying comprises appending text to the editing instruction including a value based on the reward score. 8 . The method of claim 1 , further comprising: scaling and rounding the reward score to an integer over a predetermined range of values. 9 . The method of claim 1 , wherein the generating the noisy latent image representation of the edited image comprises: encoding, via the encoder, the edited image into a latent representation of the edited image; and adding a generated noise to the latent representation of the edited image. 10 . A system for training a neural network based instructional image editing model, the system comprising: a memory that stores the neural network based instructional image editing model and a plurality of processor-executable instructions; a communication interface that receives a training dataset comprising an input image, an editing instruction, and an edited image; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating a noisy latent image representation of the edited image by gradually adding noise to a latent representation of the edited image; generating, by the neural network based instructional image editing model, an estimated noise from the noisy latent image representation based on the input image and the editing instruction; computing, by a neural network based reward model, a reward score indicative of an alignment level between the edited image and the input image according to the editing instruction; computing a loss objective based on the added noise, the estimated noise, and the reward score; and training the neural network based instructional image editing model based on the computed loss objective via backpropagation. 11 . The system of claim 10 , the operations further comprising: receiving, via the communication interface, a second training dataset comprising a second input image, and a second editing instruction; generating, by the neural network based instructional image editing model, a plurality of candidate edited images based on the second input image and the second editing instruction; displaying the plurality of candidate edited images on a display; receiving an indication of a quality associated with the plurality of candidate edited images; and training the neural network based reward model based on the indication. 12 . The system of claim 11 , wherein the indication comprises a ranking of the plurality of candidate edited images. 13 . The system of claim 10 , wherein the neural network based instructional image editing model comprises a series of neural network based denoising models, wherein each neural network based denoising model generates a respective estimated noise from an input image representations, and wherein the estimated noise from the noisy latent image representation is one of the respective estimated noise. 14 . The system of claim 10 , wherein the computing the loss objective comprises weighting the loss objective based on the reward score. 15 . The system of claim 10 , the operations further comprising: modifying the editing instruction based on the reward score. 16 . The system of claim 15 , wherein the modifying comprises appending text to the editing instruction including a value based on the reward score. 17 . The system of claim 10 , the operations further comprising: scaling and rounding the reward score to an integer over a predetermined range of values. 18 . The system of claim 10 , wherein the generating the noisy latent image representation of the edited image comprises: encoding, via the encoder, the edited image into a latent representation of the edited image; and adding a generated noise to the latent representation of the edited image. 19 . A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a training dataset comprising an input image, an editing instruction, and an edited image; generating a noisy latent image representation of the edited image by gradually adding noise to a latent representation of the edited image; generating, by a neural network based instructional image editing model, an estimated noise from the noisy latent image representation based on the input image and the editing instruction; computing, by a neural network based reward model, a reward score indicative of an alignment level between the edited image and the input image according to the editing instruction; computing a loss objective based on the added noise, the estimated noise, and the reward score; and training the neural network based instructional image editing model based on the computed loss objective via backpropagation. 20 . The non-transitory machine-readable medium of claim 19 , the operations further comprising: receiving, via the data interface, a second training dataset
Denoising; Smoothing · CPC title
Training; Learning · CPC title
Artificial neural networks [ANN] · CPC title
using machine learning, e.g. neural networks · CPC title
Creating or editing images; Combining images with text · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.