Text-based framework for video object selection

US12266181B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12266181-B2
Application numberUS-202117531568-A
CountryUS
Kind codeB2
Filing dateNov 19, 2021
Priority dateNov 19, 2021
Publication dateApr 1, 2025
Grant dateApr 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments are disclosed for receiving a user input and an input video comprising multiple frames. The method may include extracting a text feature from the user input. The method may further include extracting a plurality of image features from the frames. The method may further include identifying one or more keyframes from the frames that include the object. The method may further include clustering one or more groups of the one or more keyframes. The method may further include generating a plurality of segmentation masks for each group. The method may further include determining a set of reference masks corresponding to the user input and the object. The method may further include generating a set of fusion masks by combining the plurality of segmentation masks and the set of reference masks. The method may further include propagating the set of fusion masks and outputting a final set of masks.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method comprising: receiving a user input and an input video comprising a plurality of frames; generating a plurality of segmentation masks for the plurality of frames; determining a set of reference masks corresponding to the user input and an object; generating a set of fusion masks by combining the plurality of segmentation masks and the set of reference masks; propagating the set of fusion masks between the plurality of segmentation masks; and outputting a final set of masks for the input video. 2. The method of claim 1 , wherein determining the set of reference masks corresponding to the user input and an object further comprises: extracting, using a first machine learning model, a text feature from the user input, wherein the text feature corresponds to the object; extracting, using a second machine learning model, a plurality of image features from the plurality of frames, wherein a selected image feature corresponds to the object; identifying one or more keyframes from the plurality of frames that include the object; and clustering one or more groups of the one or more keyframes that are within a threshold proximity to each other. 3. The method of claim 2 , further comprising: ranking the one or more groups based on a size of each group, wherein the size of each group is a number of frames in each group. 4. The method of claim 2 , wherein identifying the one or more keyframes from the plurality of frames that include the object comprises computing a similarity score by a trained neural network, and wherein the trained neural network is trained on a plurality of images and text inputs. 5. The method of claim 4 , wherein determining the set of reference masks corresponding to the user input and the object comprises: concatenating the selected image feature and the text feature to form a concatenated feature; performing a cross-modal encoding of the concatenated feature; and decoding, by a feature pyramid network, the concatenated feature to form an object mask. 6. The method of claim 2 , wherein extracting, using the first machine learning model, the text feature from the user input comprises: parsing the user input into parts of speech; filtering the parts of speech to form an object vocabulary comprising nouns and pronouns; and localizing the plurality of image features using the object vocabulary. 7. The method of claim 1 further comprising presenting the final set of masks and the input video to a user via a graphical user interface. 8. A non-transitory computer-readable storage medium including instructions stored thereon which, when executed by at least one processor, cause the at least one processor to: receive a user input and an input video comprising a plurality of frames; generate a plurality of segmentation masks for the plurality of frames; determine a set of reference masks corresponding to the user input and an object; generate a set of fusion masks by combining the plurality of segmentation masks and the set of reference masks; propagate the set of fusion masks between the plurality of segmentation masks; and output a final set of masks for the input video. 9. The non-transitory computer-readable storage medium of claim 8 , wherein the instructions to determine a set of reference masks corresponding to the user input and an object comprise instructions which, when executed by at least one processor, cause the at least one processor to: extract, using a first machine learning model, a text feature from the user input, wherein the text feature corresponds to an object; extract, using a second machine learning model, a plurality of image features from the plurality of frames, wherein a selected image feature corresponds to the object; identify one or more keyframes from the plurality of frames that include the object; and cluster one or more groups of the one or more keyframes that are within a threshold proximity to each other. 10. The non-transitory computer-readable storage medium of claim 9 , the instructions further causing the processor to rank the one or more groups based on a size of each group, wherein the size of each group is a number of frames in each group. 11. The non-transitory computer-readable storage medium of claim 9 , the instructions further causing the processor to identify the one or more keyframes from the plurality of frames that include the object by computing a similarity score by a trained neural network, and wherein the trained neural network is trained on a plurality of images and text inputs. 12. The non-transitory computer-readable storage medium of claim 11 , wherein the instructions to determine the set of reference masks corresponding to the user input and the object comprise instructions which, when executed by at least one processor, cause the at least one processor to: concatenate the selected image feature and the text feature to form a concatenated feature; perform a cross-modal encoding of the concatenated feature; and decode, by a feature pyramid network, the concatenated feature to form an object mask. 13. The non-transitory computer-readable storage medium of claim 9 , wherein the instructions to extract, using the first machine learning model, the text feature from the user input comprise instructions which, when executed by at least one processor, cause the at least one processor to: parsing the user input into parts of speech; filtering the parts of speech to form an object vocabulary comprising nouns and pronouns; and localizing the plurality of image features using the object vocabulary. 14. The non-transitory computer-readable storage medium of claim 8 , the instructions further causing the processor to present the final set of masks and the input video to a user via a graphical user interface. 15. A system comprising: a processor; and a memory including instructions which, when executed by the processor, cause the system to: receive a user input and an input video comprising a plurality of frames; generate a plurality of segmentation masks for the plurality of frames; determine a set of reference masks corresponding to the user input and an object; generate a set of fusion masks by combining the plurality of segmentation masks and the set of reference masks; propagate the set of fusion masks between the plurality of segmentation masks; and output a final set of masks for the input video. 16. The system of claim 15 , wherein the instructions which, when executed by the processor, cause the system to determine a set of reference masks corresponding to the user input and an object, further cause the processor to: extract, using a first machine learning model, a text feature from the user input, wherein the text feature corresponds to an object; extract, using a second machine learning model, a plurality of image features from the plurality of frames, wherein a selected image feature corresponds to the object; identify one or more keyframes from the plurality of frames that include the object; and cluster one or more groups of the one or more keyframes that are within a threshold proximity to each other. 17. The system of claim 16 , the instructions further causing the processor to rank the one or more groups based on a size of each group, wherein the size of each group is a number of frames in each group. 18. The system of claim 16 , the instructions further causing the processor to identify the one or more keyframes from the plurality of frames that include the object by computing a

Assignees

Inventors

Classifications

  • of input or preprocessed data · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Clustering techniques · CPC title

  • Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

  • Parsing · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12266181B2 cover?
Embodiments are disclosed for receiving a user input and an input video comprising multiple frames. The method may include extracting a text feature from the user input. The method may further include extracting a plurality of image features from the frames. The method may further include identifying one or more keyframes from the frames that include the object. The method may further include c…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06V20/49. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).