Systems and methods for determining video feature descriptors based on convolutional neural networks

US9858484B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9858484-B2
Application numberUS-201414585826-A
CountryUS
Kind codeB2
Filing dateDec 30, 2014
Priority dateDec 30, 2014
Publication dateJan 2, 2018
Grant dateJan 2, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and non-transitory computer-readable media can acquire video content for which video feature descriptors are to be determined. The video content can be processed based at least in part on a convolutional neural network including a set of two-dimensional convolutional layers and a set of three-dimensional convolutional layers. One or more outputs can be generated from the convolutional neural network. A plurality of video feature descriptors for the video content can be determined based at least in part on the one or more outputs from the convolutional neural network.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: acquiring, by a computing system, video content for which video feature descriptors are to be determined, wherein the video content is represented as a plurality of two-dimensional image frames, wherein each of the plurality of two-dimensional image frames extends in a first spatial dimension and a second spatial dimension, wherein the plurality of two-dimensional image frames are temporally sorted, and wherein a third dimension corresponds to a time dimension with respect to which the plurality of two-dimensional image frames are temporally sorted; processing, by the computing system, the video content based at least in part on a convolutional neural network including a set of two-dimensional convolutional layers and a set of three-dimensional convolutional layers, wherein at least a portion of signals outputted by the set of two-dimensional convolutional layers are inputted into the set of three-dimensional convolutional layers, and wherein the set of three-dimensional convolutional layers generate one or more outputs based on the signals outputted by the set of two-dimensional convolutional layers; and determining, by the computing system, based at least in part on the one or more outputs from the set of three-dimensional convolutional layers in the convolutional neural network, a plurality of video feature descriptors for the video content. 2. The computer-implemented method of claim 1 , wherein the processing of the video content based at least in part on the convolutional neural network further comprises: inputting a representation of the video content into the set of two-dimensional convolutional layers; applying, within the set of two-dimensional convolutional layers, at least one two-dimensional convolutional operation to the representation of the video content; outputting a first collection of signals from the set of two-dimensional convolutional layers; inputting at least a portion of the first collection of signals into the set of three-dimensional convolutional layers; applying, within the set of three-dimensional convolutional layers, at least one three-dimensional convolutional operation to at least the portion of the first collection of signals; and outputting a second collection of signals from the set of three-dimensional convolutional layers, wherein the one or more outputs from the convolutional neural network are dependent on at least a portion of the second collection of signals. 3. The computer-implemented method of claim 2 , wherein the convolutional neural network includes a set of fully-connected layers, wherein at least the portion of the second collection of signals is inputted into the set of fully-connected layers, wherein the set of fully-connected layers outputs a third collection of signals, and wherein the one or more outputs from the convolutional neural network are generated based at least in part on at least a portion of the third collection of signals. 4. The computer-implemented method of claim 2 , wherein the at least one two-dimensional convolutional operation utilizes at least one two-dimensional filter to convolve the representation of the video content, and wherein the representation of the video content is reduced in signal size based at least in part on the at least one two-dimensional convolutional operation. 5. The computer-implemented method of claim 2 , wherein the at least one three-dimensional convolutional operation utilizes at least one three-dimensional filter to convolve at least the portion of the first collection of signals. 6. The computer-implemented method of claim 1 , wherein the set of two-dimensional convolutional layers includes at least five two-dimensional convolutional layers, and wherein the set of three-dimensional convolutional layers includes at least three three-dimensional convolutional layers. 7. The computer-implemented method of claim 1 , further comprising: training the convolutional neural network based at least in part on the video content, wherein the video content is associated with one or more labels for at least one of a recognized scene, a recognized object, or a recognized action. 8. The computer-implemented method of claim 7 , wherein the training of the convolutional neural network further comprises: determining one or more differences between the one or more labels and the plurality of video feature descriptors; and adjusting one or more weight values of one or more filters associated with the convolutional neural network to minimize the one or more differences, wherein the adjusting of the one or more weight values occurs during a backpropagation through the convolutional neural network. 9. The computer-implemented method of claim 1 , wherein the video feature descriptors provide a first set of metrics indicating likelihoods that specified scenes are represented in the video content, a second set of metrics indicating likelihoods that specified objects are represented in the video content, and a third set of metrics indicating likelihoods that specified actions are represented in the video content. 10. A system comprising: at least one processor; and a memory storing instructions that, when executed by the at least one processor, cause the system to perform: acquiring video content for which video feature descriptors are to be determined, wherein the video content is represented as a plurality of two-dimensional image frames, wherein each of the plurality of two-dimensional image frames extends in a first spatial dimension and a second spatial dimension, wherein the plurality of two-dimensional image frames are temporally sorted, and wherein a third dimension corresponds to a time dimension with respect to which the plurality of two-dimensional image frames are temporally sorted; processing the video content based at least in part on a convolutional neural network including a set of two-dimensional convolutional layers and a set of three-dimensional convolutional layers, wherein at least a portion of signals outputted by the set of two-dimensional convolutional layers are inputted into the set of three-dimensional convolutional layers, and wherein the set of three-dimensional convolutional layers generate one or more outputs based on the signals outputted by the set of two-dimensional convolutional layers; and determining based at least in part on the one or more outputs from the set of three-dimensional convolutional layers in the convolutional neural network, a plurality of video feature descriptors for the video content. 11. The system of claim 10 , wherein the video content is represented as a plurality of two-dimensional image frames, wherein each of the plurality of two-dimensional image frames extends in a first spatial dimension and a second spatial dimension, wherein the plurality of two-dimensional image frames is temporally sorted, and wherein a third dimension corresponds to a time dimension with respect to which the plurality of two-dimensional image frames is temporally sorted. 12. The system of claim 10 , wherein the processing of the video content based at least in part on the convolutional neural network further comprises: inputting a representation of the video content into the set of two-dimensional convolutional layers; applying, within the set of two-dimensional convolutional layers, at least one two-dimensional convolutional operation to the representation of the video content; outputting a first collection of signals from the set of two-dimensional convolutional layers; inputting at least a portion of the first collection of signals into the set of three-dimensional convolutional layers;

Assignees

Inventors

Classifications

  • G06N3/084Primary

    Backpropagation, e.g. using gradient descent · CPC title

  • Combinations of networks · CPC title

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9858484B2 cover?
Systems, methods, and non-transitory computer-readable media can acquire video content for which video feature descriptors are to be determined. The video content can be processed based at least in part on a convolutional neural network including a set of two-dimensional convolutional layers and a set of three-dimensional convolutional layers. One or more outputs can be generated from the convo…
Who is the assignee on this patent?
Facebook Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/084. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 02 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).