What technology area does this patent fall under?

Primary CPC classification G06V10/82. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 05 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Three-dimensional convolutional neural networks for video highlight detection

US9836853B1 · US · B1

Patent metadata
Field	Value
Publication number	US-9836853-B1
Application number	US-201615256874-A
Country	US
Kind code	B1
Filing date	Sep 6, 2016
Priority date	Sep 6, 2016
Publication date	Dec 5, 2017
Grant date	Dec 5, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A three-dimensional convolutional neural network may include a preliminary layer group, one or more intermediate layer groups, a final layer group, and/or other layers/layer groups. The preliminary layer group may include an input layer, a preliminary three-dimensional padding layer, a preliminary three-dimensional convolution layer, a preliminary activation layer, a preliminary normalization layer, and a preliminary downsampling layer. One or more intermediate layer groups may include an intermediate three-dimensional squeeze layer, a first intermediate normalization layer, an intermediate three-dimensional padding layer, a first intermediate three-dimensional expand layer, a second intermediate three-dimensional expand layer, an intermediate concatenation layer, a second intermediate normalization layer, an intermediate activation layer, and an intermediate combination layer. The final layer group may include a final dropout layer, a final three-dimensional convolution layer, a final activation layer, a final normalization layer, a final three-dimensional downsampling layer, and a final flatten layer.

First claim

Opening claim text (preview).

What is claimed is: 1. A three-dimensional convolutional neural network system for video highlight detection, the system comprising: one or more physical processors configured by machine-readable instructions to: access video content, the video content having a duration; segment the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration; input the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising: a preliminary layer group that, for the individual video segments: accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels, increases the dimensionality of the video segment map; convolves the video segment map to produce a first set of feature maps; applies a first activating function to the first set of feature maps; normalizes the first set of feature maps; and downsamples the first set of feature maps; one or more intermediate layer groups that, for the individual video segments: receives a first output from a layer preceding the individual intermediate layer group: convolves the first output to reduce a number of channels of the first output; normalizes the first output; increases the dimensionality of the first output; convolves the first output to produce a second set of feature maps; convolves the first output to produce a third set of feature maps; concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps; normalizes the set of concatenated feature maps; applies a second activating function to the set of concatenated feature maps; and combines the set of concatenated feature maps and the first output; and a final layer group that, for the individual video segments: receives a second output from a layer preceding the final layer group; reduces an overfitting from the second output; convolves the second output to produce a fourth set of feature maps; applies a third activating function to the fourth set of feature maps; normalizes the fourth set of feature maps; downsamples the fourth set of feature maps; and converts the fourth set of feature maps into a spatiotemporal feature vector; input the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors; and determine a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors. 2. The system of claim 1 , wherein individual predicted spatiotemporal feature vectors corresponding to the individual video segments characterizes a prediction of a video segment following the individual video segments within the duration. 3. The system of claim 1 , wherein individual predicted spatiotemporal feature vectors for the individual video segments characterizes a prediction of a video segment preceding the individual video segments within the duration. 4. The system of claim 2 , wherein: the first set of spatiotemporal feature vectors includes a first spatiotemporal feature vector corresponding to the first video segment and a second spatiotemporal feature vector corresponding to the second video segment; the first set of predicted spatiotemporal feature vectors includes a first predicted spatiotemporal feature vector determined based on the first spatiotemporal feature vector, the first predicted spatiotemporal feature vector characterizing a prediction of the second video segment; and the comparison of the one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with the one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors includes a comparison of the second spatiotemporal feature vector with the first predicted spatiotemporal feature vector. 5. The system of claim 1 , wherein the presence of the highlight moment within the video content is determined based on a difference between the one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors and the one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors meeting or being below a threshold. 6. The system of claim 1 , wherein the one or more physical processors are further configured by machine-readable instructions to input two or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors into a categorization layer, the categorization layer determining a category for the video content based on the two or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors. 7. The system of claim 1 , wherein the first number of video frames includes sixteen video frames. 8. The system of claim 1 , wherein the one or more physical processors are further configured by machine-readable instructions to: segment the video content into a second set of video segments, individual video segments within the second set of video segments including a second number of video frames, the second number of video frames being different from the first number of video frames; input the second set of video segments into a second three-dimensional convolutional neural network, the second three-dimensional convolutional neural network outputting a second set of spatiotemporal feature vectors corresponding to the second set of video segments; input the second set of spatiotemporal feature vectors into the long short-term memory network, the long short-term memory network determining a second set of predicted spatiotemporal feature vectors based on the second set of spatiotemporal feature vectors; and determine the presence of the highlight moment within the video content further based on a comparison of one or more spatiotemporal feature vectors of the second set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the second set of predicted spatiotemporal feature vectors. 9. The system of claim 1 , wherein the first three-dimensional convolutional neural network is initialized with pre-trained weights from a trained two-dimensional convolutional neural network, the pre-trained weights from the trained two-dimensional convolutional neural network being stacked along a time dimension. 10. The system of claim 1 , wherein the long short-term memory network is trained with second video content including highlights. 11. A method for using a three-dimensional convolutional neural network for video highlight detection, the method comprising: accessing video content, the video content having a duration; segmenting the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video

Assignees

Gopro Inc

Inventors

Médioni Tom

Classifications

G06V10/82Primary
using neural networks · CPC title
G06V10/764
using classification, e.g. of video objects · CPC title
G06N3/08
Learning methods · CPC title
G06N3/044
Recurrent networks, e.g. Hopfield networks · CPC title
G06F18/24143
Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN] · CPC title

Patent family

Related publications grouped by family.

View patent family 60452100

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9836853B1 cover?: A three-dimensional convolutional neural network may include a preliminary layer group, one or more intermediate layer groups, a final layer group, and/or other layers/layer groups. The preliminary layer group may include an input layer, a preliminary three-dimensional padding layer, a preliminary three-dimensional convolution layer, a preliminary activation layer, a preliminary normalization l…
Who is the assignee on this patent?: Gopro Inc
What technology area does this patent fall under?: Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 05 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).