Predicting video edits from text-based conversations using neural networks

US12238451B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12238451-B2
Application numberUS-202218055301-A
CountryUS
Kind codeB2
Filing dateNov 14, 2022
Priority dateNov 14, 2022
Publication dateFeb 25, 2025
Grant dateFeb 25, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments are disclosed for predicting, using neural networks, editing operations for application to a video sequence based on processing conversational messages by a video editing system. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving an input including a video sequence and text sentences, the text sentences describing a modification to the video sequence, mapping, by a first neural network content of the text sentences describing the modification to the video sequence to a candidate editing operation, processing, by a second neural network, the video sequence to predict parameter values for the candidate editing operation, and generating a modified video sequence by applying the candidate editing operation with the predicted parameter values to the video sequence.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method comprising: receiving an input including a video sequence and text sentences, the text sentences describing a modification to the video sequence; mapping, by a first neural network, content of the text sentences describing the modification to the video sequence to a candidate video editing operation; processing, by a second neural network, the video sequence to predict parameter values for the candidate video editing operation; and generating a modified video sequence by applying the candidate video editing operation with the predicted parameter values to the video sequence. 2. The computer-implemented method of claim 1 , wherein mapping the content of the text sentences describing the modification to the video sequence to the candidate video editing operation comprises: mapping the content of the text sentences to a reference sentence; and identifying a video editing operation associated with the reference sentence as the candidate video editing operation. 3. The computer-implemented method of claim 2 , wherein mapping the content of the text sentences to the reference sentence comprises: generating, by a sentence transformer, sentence features for the text sentences; calculating cosine similarity values between the sentence features for the text sentences and reference sentence features for reference sentences, wherein each reference sentence of the reference sentences is associated with a video editing operation; and identifying the reference sentence having a highest calculated cosine similarity with the sentence features for the text sentences. 4. The computer-implemented method of claim 1 , wherein processing the video sequence to predict the parameter values for the candidate video editing operation comprises: for each frame of the video sequence: generating an RGB feature vector and an optical flow feature vector, concatenating the RGB feature vector and the optical flow feature vector to create a concatenated feature vector, and passing the concatenated feature vector through an editing parameters prediction network to predict the parameter values for the candidate video editing operation. 5. The computer-implemented method of claim 4 , wherein the predicted parameter values for the candidate video editing operation include mean parameter values and standard deviation parameter values. 6. The computer-implemented method of claim 1 , further comprising: receiving a second input including second text sentences, the second text sentences describing a second modification to trim the video sequence by a first amount of time; detecting shot boundaries within the video sequence; determining that an end time of a shot boundary is within the first amount of time; and trimming a second amount of time from the video sequence starting at the end time of the shot boundary, wherein the second amount of time is different from the first amount of time. 7. The computer-implemented method of claim 1 , wherein generating the modified video sequence by applying the candidate video editing operation with the predicted parameter values for the candidate video editing operation to the video sequence comprises: adjusting a brightness parameter in response to mapping the text sentences to a brightness editing operation. 8. The computer-implemented method of claim 1 , further comprising: receiving a second input including a second video sequence; processing, by the second neural network, the second video sequence to predict parameter values for one or more video editing operations; and generating a modified second video sequence by applying the one or more video editing operations with the predicted parameter values to the second video sequence. 9. A non-transitory computer-readable storage medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations comprising: receiving an input including a video sequence and text sentences, the text sentences describing a modification to the video sequence; mapping, by a first neural network, content of the text sentences describing the modification to the video sequence to a candidate video editing operation; processing, by a second neural network, the video sequence to predict parameter values for the candidate video editing operation; and generating a modified video sequence by applying the candidate video editing operation with the predicted parameter values to the video sequence. 10. The non-transitory computer-readable storage medium of claim 9 , wherein to map the content of the text sentences describing the modification to the video sequence to the candidate video editing operation the instructions further cause the processing device to perform operations comprising: mapping the content of the text sentences to a reference sentence; and identifying a video editing operation associated with the reference sentence as the candidate video editing operation. 11. The non-transitory computer-readable storage medium of claim 10 , wherein to map the content of the text sentences to the reference sentence the instructions further cause the processing device to perform operations comprising: generating, by a sentence transformer, sentence features for the text sentences; calculating cosine similarity values between the sentence features for the text sentences and reference sentence features for reference sentences, wherein each reference sentence of the reference sentences is associated with a video editing operation; and identifying the reference sentence having a highest calculated cosine similarity with the sentence features for the text sentences. 12. The non-transitory computer-readable storage medium of claim 9 , wherein to process the video sequence to predict the parameter values for the candidate video editing operation the instructions further cause the processing device to perform operations comprising: for each frame of the video sequence: generating an RGB feature vector and an optical flow feature vector, concatenating the RGB feature vector and the optical flow feature vector to create a concatenated feature vector, and passing the concatenated feature vector through an editing parameters prediction network to predict the parameter values for the candidate video editing operation. 13. The non-transitory computer-readable storage medium of claim 12 , wherein the predicted parameter values for the candidate video editing operation include mean parameter values and standard deviation parameter values. 14. The non-transitory computer-readable storage medium of claim 9 , wherein the instructions further cause the processing device to perform operations comprising: receiving a second input including second text sentences, the second text sentences describing a second modification to trim the video sequence by a first amount of time; detecting shot boundaries within the video sequence; determining that an end time of a shot boundary is within the first amount of time; and trimming a second amount of time from the video sequence starting at the end time of the shot boundary, wherein the second amount of time is different from the first amount of time. 15. The non-transitory computer-readable storage medium of claim 9 , wherein to generate the modified video sequence by applying the candidate video editing operation with the predicted parameter values for the candidate video editing operation to the video sequence the instructions further cause the processing device to perform operations comprising: adjusting a brightness

Assignees

Inventors

Classifications

  • Creating or editing images; Combining images with text · CPC title

  • Learning methods · CPC title

  • Electronic editing of digitised analogue information signals, e.g. audio or video signals · CPC title

  • H04N7/002Primary

    Special television systems not provided for by H04N7/007 - H04N7/18 (still pictures via a television channel H04N1/00098) · CPC title

  • G06N3/045Primary

    Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12238451B2 cover?
Embodiments are disclosed for predicting, using neural networks, editing operations for application to a video sequence based on processing conversational messages by a video editing system. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving an input including a video sequence and text sentences, the text sentences describing a modification to the vi…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification H04N7/002. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Feb 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).