Generating action tags for digital videos

US11949964B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11949964-B2
Application numberUS-202117470441-A
CountryUS
Kind codeB2
Filing dateSep 9, 2021
Priority dateApr 16, 2019
Publication dateApr 2, 2024
Grant dateApr 2, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and non-transitory computer-readable media are disclosed for automatic tagging of videos. In particular, in one or more embodiments, the disclosed systems generate a set of tagged feature vectors (e.g., tagged feature vectors based on action-rich digital videos) to utilize to generate tags for an input digital video. For instance, the disclosed systems can extract a set of frames for the input digital video and generate feature vectors from the set of frames. In some embodiments, the disclosed systems generate aggregated feature vectors from the feature vectors. Furthermore, the disclosed systems can utilize the feature vectors (or aggregated feature vectors) to identify similar tagged feature vectors from the set of tagged feature vectors. Additionally, the disclosed systems can generate a set of tags for the input digital videos by aggregating one or more tags corresponding to identified similar tagged feature vectors.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computer system to: extract a plurality of frames from a video; generate, utilizing a neural network, feature vectors for frames of the plurality of frames; combine a subset of the feature vectors to generate an aggregated feature vector; select one or more tagged feature vectors from a set of tagged feature vectors based on distances between the aggregated feature vector and the one or more tagged feature vectors from the set of tagged feature vectors, wherein the set of tagged feature vectors comprise feature vectors generated from particular media content items and tagged with labels that correspond to the particular media content items; generate a set of tags to associate with the video by selecting tags from the one or more tagged feature vectors and aggregating the tags selected from the one or more tagged feature vectors; and tag the video with the set of tags. 2. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computer system to combine the subset of the feature vectors to generate the aggregated feature vector by pooling feature values from the subset of the feature vectors. 3. The non-transitory computer-readable medium of claim 1 , further comprising identifying the set of tagged feature vectors from a tagged feature vector data storage comprising pre-tagged feature vectors. 4. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computer system to: select the one or more tagged feature vectors from the set of tagged feature vectors by: determining distance values between the aggregated feature vector and the one or more tagged feature vectors from the set of tagged feature vectors; and selecting the one or more tagged feature vectors based on the one or more tagged feature vectors having distance values that meet a threshold distance value. 5. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computer system to group the plurality of frames into a plurality of groups based on one or more characteristics of the frames of the plurality of frames; wherein subset of the feature vectors comprises feature vectors of the frames in a group of the plurality of groups. 6. The non-transitory computer-readable medium of claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to group the plurality of frames into the plurality of groups based on the one or more characteristics of the frames of the plurality of frames by grouping the frames based on time stamps associated with the frames. 7. The non-transitory computer-readable medium of claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to group the plurality of frames into the plurality of groups based on the one or more characteristics of the frames of the plurality of frames by grouping the frames into delineated scenes within the video. 8. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computer system to identify the set of tagged feature vectors by: identifying a media content item comprising text representing one or more verbs; generating, utilizing the neural network, a tagged feature vector for the media content item; assigning tags to the tagged feature vector by assigning the one or more verbs to the tagged feature vector; and associating the tagged feature vector with the set of tagged feature vectors. 9. The non-transitory computer-readable medium of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the computer system to associate the set of tags with a temporal segment of the video comprising the plurality of frames. 10. The non-transitory computer-readable medium of claim 9 , further comprising instructions that, when executed by the at least one processor, cause the computer system to: provide graphical user interface displaying the video; provide a timeline for the video in the graphical user interface; and place a tag indicator associated with a tag of the set of tags on the timeline at a position corresponding to the temporal segment of the video. 11. A system comprising: memory comprising a neural network and a set of tagged feature vectors corresponding to a set of media content items, the set of tagged feature vectors comprising feature vectors generated from media content items and tagged with labels that correspond to content of the media content items; and at least one server configured to cause the system to: extract a plurality of frames from a video; generate, utilizing a neural network, feature vectors for frames of the plurality of frames; generate aggregated feature vectors by combining subsets of the feature vectors; determine tags for the aggregated feature vectors by: selecting one or more tagged feature vectors from the set of tagged feature vectors based on distances between the aggregated feature vectors and the one or more tagged feature vectors, wherein the set of tagged feature vectors comprise feature vectors generated from particular media content items and tagged with labels that correspond to the particular media content items; and extracting the tags associated with the one or more tagged feature vectors; and tag the frames of the video associated with the aggregated feature vectors with the extracted tags. 12. The system of claim 11 , wherein the at least one server is further configured to cause the system to generate aggregated feature vectors by, for a given aggregated feature vector, utilizing averaging pooling or max pooling to combine feature vectors in a subset of feature vectors. 13. The system of claim 11 , wherein selecting the one or more tagged feature vectors from the set of tagged feature vectors based on distances between the aggregated feature vectors and the one or more tagged feature vectors comprises utilizing a k-nearest neighbor algorithm. 14. The system of claim 11 , wherein the at least one server is further configured to cause the system to: receive a search request to identify videos associated with an action; identify that the video is tagged with a tag corresponding to the action; and returning the video in response to the search request. 15. The system of claim 11 , wherein the at least one server is further configured to cause the system to: cluster the plurality of frames into a plurality of groups, each group of the plurality of groups corresponding to scene from the video; and wherein each subset of feature vectors comprises the feature vectors of the frames of a given group of the plurality of groups. 16. The system of claim 11 , wherein the at least one server is further configured to cause the system to generate, utilizing the neural network, feature vectors for frames of the plurality of frames by utilizing an image classification neural network to extract visual characteristics and latent attributes in different levels of abstractions from a frame of the plurality of frames. 17. A computer-implemented method for automatic tagging of videos, the computer-implemented method comprising: extracting

Assignees

Inventors

Classifications

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • specifically related to the content, e.g. biography of the actors in a movie, detailed information about an article seen in a video programme · CPC title

  • Learning methods · CPC title

  • Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11949964B2 cover?
Systems, methods, and non-transitory computer-readable media are disclosed for automatic tagging of videos. In particular, in one or more embodiments, the disclosed systems generate a set of tagged feature vectors (e.g., tagged feature vectors based on action-rich digital videos) to utilize to generate tags for an input digital video. For instance, the disclosed systems can extract a set of fra…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification H04N21/8133. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Apr 02 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).