Visual Annotations for Objects
US-2015220504-A1 · Aug 6, 2015 · US
US9836853B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-9836853-B1 |
| Application number | US-201615256874-A |
| Country | US |
| Kind code | B1 |
| Filing date | Sep 6, 2016 |
| Priority date | Sep 6, 2016 |
| Publication date | Dec 5, 2017 |
| Grant date | Dec 5, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A three-dimensional convolutional neural network may include a preliminary layer group, one or more intermediate layer groups, a final layer group, and/or other layers/layer groups. The preliminary layer group may include an input layer, a preliminary three-dimensional padding layer, a preliminary three-dimensional convolution layer, a preliminary activation layer, a preliminary normalization layer, and a preliminary downsampling layer. One or more intermediate layer groups may include an intermediate three-dimensional squeeze layer, a first intermediate normalization layer, an intermediate three-dimensional padding layer, a first intermediate three-dimensional expand layer, a second intermediate three-dimensional expand layer, an intermediate concatenation layer, a second intermediate normalization layer, an intermediate activation layer, and an intermediate combination layer. The final layer group may include a final dropout layer, a final three-dimensional convolution layer, a final activation layer, a final normalization layer, a final three-dimensional downsampling layer, and a final flatten layer.
Opening claim text (preview).
What is claimed is: 1. A three-dimensional convolutional neural network system for video highlight detection, the system comprising: one or more physical processors configured by machine-readable instructions to: access video content, the video content having a duration; segment the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration; input the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising: a preliminary layer group that, for the individual video segments: accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels, increases the dimensionality of the video segment map; convolves the video segment map to produce a first set of feature maps; applies a first activating function to the first set of feature maps; normalizes the first set of feature maps; and downsamples the first set of feature maps; one or more intermediate layer groups that, for the individual video segments: receives a first output from a layer preceding the individual intermediate layer group: convolves the first output to reduce a number of channels of the first output; normalizes the first output; increases the dimensionality of the first output; convolves the first output to produce a second set of feature maps; convolves the first output to produce a third set of feature maps; concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps; normalizes the set of concatenated feature maps; applies a second activating function to the set of concatenated feature maps; and combines the set of concatenated feature maps and the first output; and a final layer group that, for the individual video segments: receives a second output from a layer preceding the final layer group; reduces an overfitting from the second output; convolves the second output to produce a fourth set of feature maps; applies a third activating function to the fourth set of feature maps; normalizes the fourth set of feature maps; downsamples the fourth set of feature maps; and converts the fourth set of feature maps into a spatiotemporal feature vector; input the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors; and determine a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors. 2. The system of claim 1 , wherein individual predicted spatiotemporal feature vectors corresponding to the individual video segments characterizes a prediction of a video segment following the individual video segments within the duration. 3. The system of claim 1 , wherein individual predicted spatiotemporal feature vectors for the individual video segments characterizes a prediction of a video segment preceding the individual video segments within the duration. 4. The system of claim 2 , wherein: the first set of spatiotemporal feature vectors includes a first spatiotemporal feature vector corresponding to the first video segment and a second spatiotemporal feature vector corresponding to the second video segment; the first set of predicted spatiotemporal feature vectors includes a first predicted spatiotemporal feature vector determined based on the first spatiotemporal feature vector, the first predicted spatiotemporal feature vector characterizing a prediction of the second video segment; and the comparison of the one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with the one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors includes a comparison of the second spatiotemporal feature vector with the first predicted spatiotemporal feature vector. 5. The system of claim 1 , wherein the presence of the highlight moment within the video content is determined based on a difference between the one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors and the one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors meeting or being below a threshold. 6. The system of claim 1 , wherein the one or more physical processors are further configured by machine-readable instructions to input two or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors into a categorization layer, the categorization layer determining a category for the video content based on the two or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors. 7. The system of claim 1 , wherein the first number of video frames includes sixteen video frames. 8. The system of claim 1 , wherein the one or more physical processors are further configured by machine-readable instructions to: segment the video content into a second set of video segments, individual video segments within the second set of video segments including a second number of video frames, the second number of video frames being different from the first number of video frames; input the second set of video segments into a second three-dimensional convolutional neural network, the second three-dimensional convolutional neural network outputting a second set of spatiotemporal feature vectors corresponding to the second set of video segments; input the second set of spatiotemporal feature vectors into the long short-term memory network, the long short-term memory network determining a second set of predicted spatiotemporal feature vectors based on the second set of spatiotemporal feature vectors; and determine the presence of the highlight moment within the video content further based on a comparison of one or more spatiotemporal feature vectors of the second set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the second set of predicted spatiotemporal feature vectors. 9. The system of claim 1 , wherein the first three-dimensional convolutional neural network is initialized with pre-trained weights from a trained two-dimensional convolutional neural network, the pre-trained weights from the trained two-dimensional convolutional neural network being stacked along a time dimension. 10. The system of claim 1 , wherein the long short-term memory network is trained with second video content including highlights. 11. A method for using a three-dimensional convolutional neural network for video highlight detection, the method comprising: accessing video content, the video content having a duration; segmenting the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
Learning methods · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.