Method for Implementing a High-Level Image Representation for Image Analysis
US-2017220864-A1 · Aug 3, 2017 · US
US10068138B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10068138-B2 |
| Application number | US-201615189996-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 22, 2016 |
| Priority date | Sep 17, 2015 |
| Publication date | Sep 4, 2018 |
| Grant date | Sep 4, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Devices, systems, and methods for computer recognition of action in video obtain frame-level feature sets of visual features that were extracted from respective frames of a video, wherein the respective frame-level feature set of a frame includes the respective visual features that were extracted from the frame; generate first-level feature sets, wherein each first-level feature set is generated by pooling the visual features from two or more frame-level feature sets, and wherein each first-level feature set includes pooled features; and generate second-level feature sets, wherein each second-level feature set is generated by pooling the pooled features in two or more first-level feature sets, wherein each second-level feature set includes pooled features.
Opening claim text (preview).
What is claimed is: 1. A device comprising: one or more computer-readable media; and one or more processors that are coupled to the computer-readable media and that are configured to cause the device to obtain frame-level feature sets of visual features that were extracted from respective frames of a video, wherein the respective frame-level feature set of a frame includes the respective visual features that were extracted from the frame, generate first-pooled-level feature sets, wherein each first-pooled-level feature set is generated by pooling the visual features from two or more frame-level feature sets, wherein each first-pooled-level feature set includes pooled features, and wherein a pooled feature in the first-pooled-level feature sets is generated by pooling two or more visual features, generate second-pooled-level feature sets, wherein each second-pooled-level feature set is generated by pooling the pooled features in two or more first-pooled-level feature sets, wherein each second-pooled-level feature set includes pooled features, and wherein a pooled feature in the second-pooled-level feature sets is generated by pooling two or more pooled features in the first-pooled-level feature sets, obtain trajectory features that were extracted from the video; fuse the trajectory features with at least some of the pooled features in the first-pooled-level feature sets and the pooled features in the second-pooled-level feature sets, thereby generating fused features; train classifiers based on the fused features; obtain a test video; and classify the test video using the trained classifiers. 2. The device of claim 1 , wherein the frames of the video are arranged in a temporal order, and wherein the frame-level feature sets, the first-pooled-level feature sets, and the second-pooled-level feature sets maintain the temporal order. 3. The device of claim 2 , wherein pooling the visual features from two or more frame-level feature sets includes pooling the respective frame-level feature sets of frames that are adjacent to each other in the temporal order. 4. The device of claim 1 , wherein each first-pooled-level feature set is generated by pooling the visual features from only two frame-level feature sets. 5. The device of claim 1 , wherein each first-pooled-level feature set is generated by pooling the visual features from three or more frame-level feature sets. 6. The device of claim 1 , wherein the one or more processors are further configured to cause the device to generate third-pooled-level feature sets, wherein each third-pooled-level feature set is generated by pooling the pooled features in two or more second-pooled-level feature sets, wherein each third-pooled-level feature sets includes pooled features. 7. The device of claim 1 , wherein the first-pooled-level feature sets describe the frames of the video in a first temporal scale, the second-pooled-level feature sets describe the frames of the video in a second temporal scale, and the first temporal scale is different from the second temporal scale. 8. The device of claim 7 , where the frame-level feature sets describe the frames of the video in a third temporal scale that is different from both the first temporal scale and the second temporal scale. 9. The device of claim 1 , wherein the pooling of two or more visual features uses minimum pooling, maximum pooling, or average pooling. 10. A method comprising: obtaining frame-level feature sets of visual features that were extracted from respective frames of a video, wherein the respective frame-level feature set of each frame includes the respective visual features that were extracted from the frame; pooling the visual features from a first group of two or more frame-level feature sets, thereby generating a first first-level feature set, wherein the first first-level feature set includes pooled features, and wherein at least some of the pooled features in the first first-level feature set were each generated by pooling two or more respective visual features; pooling the visual features from a second group of two or more frame-level feature sets, thereby generating a second first-level feature set, wherein the second first-level feature set includes pooled features, wherein the second group of two or more frame-level feature sets includes a least one feature set that is not included in the first group of two or more frame-level feature sets, and wherein at least some of the pooled features in the second first-level feature set were each generated by pooling two or more respective visual features; pooling the pooled features in the first first-level feature set and the second first-level feature set, thereby generating a first second-level feature set, wherein the first second-level feature set includes pooled features that were pooled from the pooled features in first first-level feature set and from the pooled features in the second first-level feature set, and wherein at least some of the pooled features in the first second-level feature set were each generated by pooling at least one respective pooled feature in the first first-level feature set with at least one respective pooled feature in the second first-level feature set; training first classifiers based on one or more of the pooled features in the first first-level feature set, on the pooled features in the second first-level feature set, and on the pooled features in the first second-level feature set; obtaining trajectory features that were extracted from the video; training second classifiers based on the trajectory features; generating combined classifiers based on the first classifiers and the second classifiers; obtaining a second video; and classifying the second video using the combined classifiers. 11. The method of claim 10 , wherein the pooling is average pooling or max pooling. 12. The method of claim 10 , wherein training the first classifiers is further based on the visual features in the frame-level feature sets. 13. One or more computer-readable storage media storing instructions that, when executed by one or more computing devices, cause the one or more computing devices to perform operations comprising: obtaining frame-level feature sets of visual features that were extracted from respective frames of a video, wherein the respective frame-level feature set of each frame includes the respective visual features that were extracted from the frame; pooling the visual features from a first group of two or more frame-level feature sets, thereby generating a first lower-pooled-level feature set, wherein the first lower-pooled-level feature set includes pooled features that were aggregated from the respective visual features of different frames, and wherein at least some of the pooled features in the first lower-pooled-level feature set were each generated by pooling two or more visual features into a single pooled feature; pooling the visual features from a second group of two or more frame-level feature sets, thereby generating a second lower-pooled-level feature set, wherein the second lower-pooled-level feature set includes pooled features that were aggregated from the respective visual features of different frames, and wherein at least some of the pooled features in the second lower-pooled-level feature set were each generated by pooling two or more visual features into a single pooled feature; pooling the pooled features in the first lower-pooled-level feature set and the second lower-pooled-level feature set, thereby generating a higher-pooled-level feature set, wherein the higher-pooled-level feature set includes pooled features that were aggregated from t
using metadata automatically derived from the content · CPC title
Physics · mapped topic
Physics · mapped topic
Detecting features for summarising video content · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.