Computer-implemented systems and methods for intelligent image analysis using spatio-temporal information
US-2024020835-A1 · Jan 18, 2024 · US
US12266181B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12266181-B2 |
| Application number | US-202117531568-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 19, 2021 |
| Priority date | Nov 19, 2021 |
| Publication date | Apr 1, 2025 |
| Grant date | Apr 1, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments are disclosed for receiving a user input and an input video comprising multiple frames. The method may include extracting a text feature from the user input. The method may further include extracting a plurality of image features from the frames. The method may further include identifying one or more keyframes from the frames that include the object. The method may further include clustering one or more groups of the one or more keyframes. The method may further include generating a plurality of segmentation masks for each group. The method may further include determining a set of reference masks corresponding to the user input and the object. The method may further include generating a set of fusion masks by combining the plurality of segmentation masks and the set of reference masks. The method may further include propagating the set of fusion masks and outputting a final set of masks.
Opening claim text (preview).
We claim: 1. A computer-implemented method comprising: receiving a user input and an input video comprising a plurality of frames; generating a plurality of segmentation masks for the plurality of frames; determining a set of reference masks corresponding to the user input and an object; generating a set of fusion masks by combining the plurality of segmentation masks and the set of reference masks; propagating the set of fusion masks between the plurality of segmentation masks; and outputting a final set of masks for the input video. 2. The method of claim 1 , wherein determining the set of reference masks corresponding to the user input and an object further comprises: extracting, using a first machine learning model, a text feature from the user input, wherein the text feature corresponds to the object; extracting, using a second machine learning model, a plurality of image features from the plurality of frames, wherein a selected image feature corresponds to the object; identifying one or more keyframes from the plurality of frames that include the object; and clustering one or more groups of the one or more keyframes that are within a threshold proximity to each other. 3. The method of claim 2 , further comprising: ranking the one or more groups based on a size of each group, wherein the size of each group is a number of frames in each group. 4. The method of claim 2 , wherein identifying the one or more keyframes from the plurality of frames that include the object comprises computing a similarity score by a trained neural network, and wherein the trained neural network is trained on a plurality of images and text inputs. 5. The method of claim 4 , wherein determining the set of reference masks corresponding to the user input and the object comprises: concatenating the selected image feature and the text feature to form a concatenated feature; performing a cross-modal encoding of the concatenated feature; and decoding, by a feature pyramid network, the concatenated feature to form an object mask. 6. The method of claim 2 , wherein extracting, using the first machine learning model, the text feature from the user input comprises: parsing the user input into parts of speech; filtering the parts of speech to form an object vocabulary comprising nouns and pronouns; and localizing the plurality of image features using the object vocabulary. 7. The method of claim 1 further comprising presenting the final set of masks and the input video to a user via a graphical user interface. 8. A non-transitory computer-readable storage medium including instructions stored thereon which, when executed by at least one processor, cause the at least one processor to: receive a user input and an input video comprising a plurality of frames; generate a plurality of segmentation masks for the plurality of frames; determine a set of reference masks corresponding to the user input and an object; generate a set of fusion masks by combining the plurality of segmentation masks and the set of reference masks; propagate the set of fusion masks between the plurality of segmentation masks; and output a final set of masks for the input video. 9. The non-transitory computer-readable storage medium of claim 8 , wherein the instructions to determine a set of reference masks corresponding to the user input and an object comprise instructions which, when executed by at least one processor, cause the at least one processor to: extract, using a first machine learning model, a text feature from the user input, wherein the text feature corresponds to an object; extract, using a second machine learning model, a plurality of image features from the plurality of frames, wherein a selected image feature corresponds to the object; identify one or more keyframes from the plurality of frames that include the object; and cluster one or more groups of the one or more keyframes that are within a threshold proximity to each other. 10. The non-transitory computer-readable storage medium of claim 9 , the instructions further causing the processor to rank the one or more groups based on a size of each group, wherein the size of each group is a number of frames in each group. 11. The non-transitory computer-readable storage medium of claim 9 , the instructions further causing the processor to identify the one or more keyframes from the plurality of frames that include the object by computing a similarity score by a trained neural network, and wherein the trained neural network is trained on a plurality of images and text inputs. 12. The non-transitory computer-readable storage medium of claim 11 , wherein the instructions to determine the set of reference masks corresponding to the user input and the object comprise instructions which, when executed by at least one processor, cause the at least one processor to: concatenate the selected image feature and the text feature to form a concatenated feature; perform a cross-modal encoding of the concatenated feature; and decode, by a feature pyramid network, the concatenated feature to form an object mask. 13. The non-transitory computer-readable storage medium of claim 9 , wherein the instructions to extract, using the first machine learning model, the text feature from the user input comprise instructions which, when executed by at least one processor, cause the at least one processor to: parsing the user input into parts of speech; filtering the parts of speech to form an object vocabulary comprising nouns and pronouns; and localizing the plurality of image features using the object vocabulary. 14. The non-transitory computer-readable storage medium of claim 8 , the instructions further causing the processor to present the final set of masks and the input video to a user via a graphical user interface. 15. A system comprising: a processor; and a memory including instructions which, when executed by the processor, cause the system to: receive a user input and an input video comprising a plurality of frames; generate a plurality of segmentation masks for the plurality of frames; determine a set of reference masks corresponding to the user input and an object; generate a set of fusion masks by combining the plurality of segmentation masks and the set of reference masks; propagate the set of fusion masks between the plurality of segmentation masks; and output a final set of masks for the input video. 16. The system of claim 15 , wherein the instructions which, when executed by the processor, cause the system to determine a set of reference masks corresponding to the user input and an object, further cause the processor to: extract, using a first machine learning model, a text feature from the user input, wherein the text feature corresponds to an object; extract, using a second machine learning model, a plurality of image features from the plurality of frames, wherein a selected image feature corresponds to the object; identify one or more keyframes from the plurality of frames that include the object; and cluster one or more groups of the one or more keyframes that are within a threshold proximity to each other. 17. The system of claim 16 , the instructions further causing the processor to rank the one or more groups based on a size of each group, wherein the size of each group is a number of frames in each group. 18. The system of claim 16 , the instructions further causing the processor to identify the one or more keyframes from the plurality of frames that include the object by computing a
of input or preprocessed data · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Clustering techniques · CPC title
Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title
Parsing · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.