Video annotation using deep network architectures
US-9330171-B1 · May 3, 2016 · US
US9858484B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9858484-B2 |
| Application number | US-201414585826-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 30, 2014 |
| Priority date | Dec 30, 2014 |
| Publication date | Jan 2, 2018 |
| Grant date | Jan 2, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems, methods, and non-transitory computer-readable media can acquire video content for which video feature descriptors are to be determined. The video content can be processed based at least in part on a convolutional neural network including a set of two-dimensional convolutional layers and a set of three-dimensional convolutional layers. One or more outputs can be generated from the convolutional neural network. A plurality of video feature descriptors for the video content can be determined based at least in part on the one or more outputs from the convolutional neural network.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: acquiring, by a computing system, video content for which video feature descriptors are to be determined, wherein the video content is represented as a plurality of two-dimensional image frames, wherein each of the plurality of two-dimensional image frames extends in a first spatial dimension and a second spatial dimension, wherein the plurality of two-dimensional image frames are temporally sorted, and wherein a third dimension corresponds to a time dimension with respect to which the plurality of two-dimensional image frames are temporally sorted; processing, by the computing system, the video content based at least in part on a convolutional neural network including a set of two-dimensional convolutional layers and a set of three-dimensional convolutional layers, wherein at least a portion of signals outputted by the set of two-dimensional convolutional layers are inputted into the set of three-dimensional convolutional layers, and wherein the set of three-dimensional convolutional layers generate one or more outputs based on the signals outputted by the set of two-dimensional convolutional layers; and determining, by the computing system, based at least in part on the one or more outputs from the set of three-dimensional convolutional layers in the convolutional neural network, a plurality of video feature descriptors for the video content. 2. The computer-implemented method of claim 1 , wherein the processing of the video content based at least in part on the convolutional neural network further comprises: inputting a representation of the video content into the set of two-dimensional convolutional layers; applying, within the set of two-dimensional convolutional layers, at least one two-dimensional convolutional operation to the representation of the video content; outputting a first collection of signals from the set of two-dimensional convolutional layers; inputting at least a portion of the first collection of signals into the set of three-dimensional convolutional layers; applying, within the set of three-dimensional convolutional layers, at least one three-dimensional convolutional operation to at least the portion of the first collection of signals; and outputting a second collection of signals from the set of three-dimensional convolutional layers, wherein the one or more outputs from the convolutional neural network are dependent on at least a portion of the second collection of signals. 3. The computer-implemented method of claim 2 , wherein the convolutional neural network includes a set of fully-connected layers, wherein at least the portion of the second collection of signals is inputted into the set of fully-connected layers, wherein the set of fully-connected layers outputs a third collection of signals, and wherein the one or more outputs from the convolutional neural network are generated based at least in part on at least a portion of the third collection of signals. 4. The computer-implemented method of claim 2 , wherein the at least one two-dimensional convolutional operation utilizes at least one two-dimensional filter to convolve the representation of the video content, and wherein the representation of the video content is reduced in signal size based at least in part on the at least one two-dimensional convolutional operation. 5. The computer-implemented method of claim 2 , wherein the at least one three-dimensional convolutional operation utilizes at least one three-dimensional filter to convolve at least the portion of the first collection of signals. 6. The computer-implemented method of claim 1 , wherein the set of two-dimensional convolutional layers includes at least five two-dimensional convolutional layers, and wherein the set of three-dimensional convolutional layers includes at least three three-dimensional convolutional layers. 7. The computer-implemented method of claim 1 , further comprising: training the convolutional neural network based at least in part on the video content, wherein the video content is associated with one or more labels for at least one of a recognized scene, a recognized object, or a recognized action. 8. The computer-implemented method of claim 7 , wherein the training of the convolutional neural network further comprises: determining one or more differences between the one or more labels and the plurality of video feature descriptors; and adjusting one or more weight values of one or more filters associated with the convolutional neural network to minimize the one or more differences, wherein the adjusting of the one or more weight values occurs during a backpropagation through the convolutional neural network. 9. The computer-implemented method of claim 1 , wherein the video feature descriptors provide a first set of metrics indicating likelihoods that specified scenes are represented in the video content, a second set of metrics indicating likelihoods that specified objects are represented in the video content, and a third set of metrics indicating likelihoods that specified actions are represented in the video content. 10. A system comprising: at least one processor; and a memory storing instructions that, when executed by the at least one processor, cause the system to perform: acquiring video content for which video feature descriptors are to be determined, wherein the video content is represented as a plurality of two-dimensional image frames, wherein each of the plurality of two-dimensional image frames extends in a first spatial dimension and a second spatial dimension, wherein the plurality of two-dimensional image frames are temporally sorted, and wherein a third dimension corresponds to a time dimension with respect to which the plurality of two-dimensional image frames are temporally sorted; processing the video content based at least in part on a convolutional neural network including a set of two-dimensional convolutional layers and a set of three-dimensional convolutional layers, wherein at least a portion of signals outputted by the set of two-dimensional convolutional layers are inputted into the set of three-dimensional convolutional layers, and wherein the set of three-dimensional convolutional layers generate one or more outputs based on the signals outputted by the set of two-dimensional convolutional layers; and determining based at least in part on the one or more outputs from the set of three-dimensional convolutional layers in the convolutional neural network, a plurality of video feature descriptors for the video content. 11. The system of claim 10 , wherein the video content is represented as a plurality of two-dimensional image frames, wherein each of the plurality of two-dimensional image frames extends in a first spatial dimension and a second spatial dimension, wherein the plurality of two-dimensional image frames is temporally sorted, and wherein a third dimension corresponds to a time dimension with respect to which the plurality of two-dimensional image frames is temporally sorted. 12. The system of claim 10 , wherein the processing of the video content based at least in part on the convolutional neural network further comprises: inputting a representation of the video content into the set of two-dimensional convolutional layers; applying, within the set of two-dimensional convolutional layers, at least one two-dimensional convolutional operation to the representation of the video content; outputting a first collection of signals from the set of two-dimensional convolutional layers; inputting at least a portion of the first collection of signals into the set of three-dimensional convolutional layers;
Backpropagation, e.g. using gradient descent · CPC title
Combinations of networks · CPC title
Supervised learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.