Video annotation using deep network architectures

US9330171B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9330171-B1
Application numberUS-201414161146-A
CountryUS
Kind codeB1
Filing dateJan 22, 2014
Priority dateOct 17, 2013
Publication dateMay 3, 2016
Grant dateMay 3, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes receiving, by a processing device of a content sharing platform, a video content, selecting at least one video frame from the video content, subsampling the at least one video frame to generate a first representation of the at least one video frame, selecting a sub-region of the at least one video frame to generate a second representation of the at least one video frame, and applying a convolutional neuron network to the first and second representations of the at least one video frame to generate an annotation for the video content.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, by a processing device of a content sharing platform, a video content; selecting at least one video frame from the video content, wherein the at least one video frame covers a spatial area at a first resolution; subsampling the at least one video frame to generate a first representation of the at least one video frame, wherein the first representation is at a second resolution that is lower than the first resolution; selecting, at the first resolution, a sub-region of the at least one video frame to generate a second representation of the at least one video frame, wherein the sub-region covers a spatial area that is smaller than the spatial area covered by the at least one video frame; and executing a convolutional neuron network, using the first representation as a first input of the convolutional neuron network and using the second representation as a second input of the convolutional neuron network, to generate an annotation for the video content. 2. The method of claim 1 , wherein the second representation is a fovea representation that is at a same spatial sampling rate as the at least one video frame. 3. The method of claim 1 , wherein the at least one video frame is a single frame. 4. The method of claim 1 , wherein the at least one video frame includes one of two consecutive video frames or at least two non-consecutive video frames. 5. The method of claim 1 , wherein the convolutional neuron network includes at least one convolution layer, at least one pooling layer, and a connected neuron network. 6. The method of claim 5 , wherein a first convolution layer and a first pooling layer are applied to a first number of video frames, and a second convolution layer and a second pooling layer are applied to a second number of video frames, and wherein the first number is different from the second number. 7. The method of claim 5 , wherein an earlier layer of the convolutional neuron network is applied to a higher number of video frames than a later layer of the convolutional neuron network. 8. The method of claim 1 , further comprising making the video content searchable according to the annotation. 9. A non-transitory machine-readable storage medium storing instructions which, when executed, cause a processing device to perform operations comprising: receiving a video content; selecting at least one video frame from the video content, wherein the at least one video frame covers a spatial area at a first resolution; subsampling the at least one video frame to generate a first representation of the at least one video frame, wherein the first representation is at a second resolution that is lower than the first resolution; selecting, at the first resolution, a sub-region of the at least one video frame to generate a second representation of the at least one video frame, wherein the sub-region covers a spatial area that is smaller than the spatial area covered by the at least one video frame; and executing a convolutional neuron network, using the first representation as a first input of the convolutional neuron network and using second representation as a second input of the convolutional neuron network, to generate an annotation for the video content. 10. The machine-readable storage medium of claim 9 , wherein the second representation is a fovea representation that is at a same spatial sampling rate as the at least one video frame. 11. The machine-readable storage medium of claim 9 , wherein the at least one video frame is a single frame. 12. The machined-readable storage medium of claim 9 , wherein the at least one video frame includes one of at least two consecutive video frames or at least two non-consecutive video frames. 13. The machine-readable storage medium of claim 9 , wherein the convolutional neuron network includes at least one convolution layer, at least one pooling layer, and a connected neuron network. 14. The machine-readable storage medium of claim 11 , wherein a first convolution layer and a first pooling layer are applied to a first number of video frames, and a second convolution layer and a second pooling layer are applied to a second number of video frames, and wherein the first number is different from the second number. 15. A system comprising: a memory; and a processor, operatively coupled to the memory, to: receive a video content; select at least one video frame from the video content, wherein the at least one video frame covers a spatial area at a first resolution; subsample the at least one video frame to generate a first representation of the at least one video frame, wherein the first representation is at a second resolution that is lower than the first resolution; select, at the first resolution, a sub-region of the at least one video frame to generate a second representation of the at least one video frame, wherein the sub-region covers a spatial area that is smaller than the spatial area covered by the at least one video frame; and execute a convolutional neuron, using the first representation as a first input of the convolutional neuron network and using the second representation as a second input of the convolutional neuron network, to generate an annotation for the video content. 16. The system of claim 15 , wherein the second representation is a fovea representation that is at a same spatial sampling rate as the at least one video frame. 17. The system of claim 15 , wherein the convolutional neuron network includes at least one convolution layer, at least one pooling layer, and a connected neuron network. 18. The user device of claim 17 , wherein a first convolution layer and a first pooling layer are applied to a first number of video frames, and a second convolution layer and a second pooling layer are applied to a second number of video frames, and wherein the first number is different from the second number.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9330171B1 cover?
A method includes receiving, by a processing device of a content sharing platform, a video content, selecting at least one video frame from the video content, subsampling the at least one video frame to generate a first representation of the at least one video frame, selecting a sub-region of the at least one video frame to generate a second representation of the at least one video frame, and a…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification G06F17/30784. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 03 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).