End-to end visual recognition system and methods

US9418317B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9418317-B2
Application numberUS-201414245159-A
CountryUS
Kind codeB2
Filing dateApr 4, 2014
Priority dateJul 8, 2010
Publication dateAug 16, 2016
Grant dateAug 16, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

We describe an end-to-end visual recognition system, where “end-to-end” refers to the ability of the system of performing all aspects of the system, from the construction of “maps” of scenes, or “models” of objects from training data, to the determination of the class, identity, location and other inferred parameters from test data. Our visual recognition system is capable of operating on a mobile hand-held device, such as a mobile phone, tablet or other portable device equipped with sensing and computing power. Our system employs a video based feature descriptor, and we characterize its invariance and discriminative properties. Feature selection and tracking are performed in real-time, and used to train a template-based classifier during a capture phase prompted by the user. During normal operation, the system scores objects in the field of view based on their ranking.

First claim

Opening claim text (preview).

What is claimed is: 1. A visual recognition apparatus for identifying objects captured in a video stream having a captured time period, the apparatus comprising: a hardware processor; and programming in a non-transitory computer readable medium and executable on the hardware processor for: capturing a video stream on an electronic device having an image sensor, said video stream comprising a plurality of temporally adjacent images; enabling a user of the electronic device to select, from said video stream, a target object or scene for training; associating each frame in an image with a corresponding frame in temporally adjacent images, or in images taken from nearby vantage points; and temporally aggregating statistics computed at one or more collections of temporally corresponding frames, into a descriptor; wherein said programming ranks image features according to their structural stability margin; and wherein said structural stability margin comprises a maximum norm of the nuisance that does not cause a singularity in the detection mechanism. 2. The apparatus recited in claim 1 , wherein said temporal aggregating of statistics is performed by computing a mean, or median, or mode, or sample histogram of a contrast-invariant function of the image in said frames. 3. The apparatus recited in claim 1 , wherein said programming performs steps comprising: spatially aggregating such statistics into a representation that is insensitive to nuisance factor and distinctive; exploiting such a representation within a classification scheme to enable the detection, localization, recognition and categorization of objects and scenes in video; and displaying the result of the classification scheme by overlaying information on the live video stream, optionally localized and overlaid on the object of interest. 4. The apparatus recited in claim 1 , wherein said programming performs steps comprising: selecting a plurality of features corresponding to translational, similarity, affine or more general reference frames from the video stream for objects in a field of view of the video stream; and performing such a selection at a plurality of scales, and using topological consistency across scale as a criterion for propagating said general reference frames across different scales. 5. The apparatus recited in claim 4 , wherein said plurality of features comprises a plurality of feature points. 6. The apparatus recited in claim 1 , wherein said programming includes a canonization mechanism which does not rely on a co-variant detector. 7. The apparatus recited in claim 1 , wherein said programming canonizes rotation in response to a gravity sensor signal. 8. The apparatus recited in claim 4 , wherein said programming performs steps comprising: computing a co-variant region that is proximate to a feature point of said feature; computing a contrast invariant feature; and performing a temporal aggregation operation of a number of statistics computed on each image associated with the plurality of video frames over a time period. 9. The apparatus recited in claim 8 , wherein the temporal aggregation operation comprises aggregating the contrast invariant feature at each video frame during the time period at the corresponding scale of a feature point of the feature. 10. A visual recognition method for identifying objects captured in a video stream having a captured time period, the method comprising: capturing a video stream on an electronic device having an image sensor, said video stream comprising a plurality of temporally adjacent image; enabling a user of the electronic device to select, from said video stream, a target object or scene for training; associating each frame in an image with a corresponding frame in temporally adjacent images, or in images taken from nearby vantage points; temporally aggregating statistics computed at one or more collections of temporally corresponding frames, into a descriptor; and ranking image features an according to their structural stability margin; wherein said structural stability margin comprises a maximum norm of the nuisance that does not cause a singularity in the detection mechanism; and wherein said method is performed by executing programming on at least one hardware processor, said programming residing on a non-transitory medium readable by the hardware processor. 11. The method recited in claim 10 , wherein said aggregation is performed by computing a mean, or median, or mode, or sample histogram of a contrast-invariant function of the image in said frames. 12. The method recited in claim 10 , further comprising: spatially aggregating such statistics into a representation that is insensitive to nuisance factor and distinctive; exploiting such a representation within a classification scheme to enable the detection, localization, recognition and categorization of objects and scenes in video; and displaying the result of the classification scheme by overlaying information on the live video stream, optionally localized and overlaid on the object of interest. 13. The method recited in claim 10 , further comprising: selecting a plurality of features corresponding to translational, similarity, affine or more general reference frames from the video stream for objects in a field of view of the video stream; and performing such a selection at a plurality of scales, and using topological consistency across scale as a criterion for propagating said general reference frames across different scales.

Assignees

Inventors

Classifications

  • Determining representative reference patterns, e.g. averaging or distorting patterns; Generating dictionaries · CPC title

  • Rule-based classification · CPC title

  • G06V10/462Primary

    Salient features, e.g. scale invariant feature transforms [SIFT] · CPC title

  • Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9418317B2 cover?
We describe an end-to-end visual recognition system, where “end-to-end” refers to the ability of the system of performing all aspects of the system, from the construction of “maps” of scenes, or “models” of objects from training data, to the determination of the class, identity, location and other inferred parameters from test data. Our visual recognition system is capable of operating on a mob…
Who is the assignee on this patent?
Univ California
What technology area does this patent fall under?
Primary CPC classification G06V10/462. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 16 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).