Methods and software for detecting objects in an image using a contextual multiscale fast region-based convolutional neural network

US10354159B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10354159-B2
Application numberUS-201715697015-A
CountryUS
Kind codeB2
Filing dateSep 6, 2017
Priority dateSep 6, 2016
Publication dateJul 16, 2019
Grant dateJul 16, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods of detecting an object in an image using a convolutional neural-network-based architecture that processes multiple feature maps of differing scales from differing convolution layers within a convolutional network to create a regional-proposal bounding box. The bounding box is projected back to the feature maps of the individual convolution layers to obtain a set of regions of interest (ROIs) and a corresponding set of context regions that provide additional context for the ROIs. These ROIs and context regions are processed to create a confidence score representing a confidence that the object detected in the bounding box is the desired object. These processes allow the method to utilize deep features encoded in both the global and the local representation for object regions, allowing the method to robustly deal with challenges in the problem of object detection. Software for executing the disclosed methods within an object-detection system is also disclosed.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of processing an image to detect the presence of one or more objects of a desired classification in the image, the method being performed in an object-detection system and comprising: receiving the image and storing the image in computer memory; sequentially convolving the image in a series of at least two convolution layers to create a corresponding series of feature maps of differing scales; pooling at least one of the feature maps to create a corresponding at least one pooled feature map; normalizing, relative to one another, the at least one pooled feature map and each of the feature maps not pooled to create a series of normalized feature maps; concatenating the series of normalized feature maps together with one another to create a concatenated feature map; dimensionally reducing the concatenated feature map to create a dimensionally reduced feature map; processing the dimensionally reduced feature map in a first set of fully connected layers to create a proposal comprising a bounding box corresponding to a suspected object of the desired classification in the image and an objectness score for the suspected object, wherein the first set of fully connected layers has been trained on the desired classification; if the objectness score exceeds a predetermined threshold, then projecting the bounding box back to each of the at least two feature maps to identify a region of interest in each of the at least two feature maps; identify a context region for each region of interest; pooling each of the regions of interest to create a corresponding pooled region of interest; pooling each of the context regions to create a corresponding pooled context region; normalizing, relative to one another, the pooled regions of interest to create a set of normalized regions of interest; normalizing, relative to one another, the pooled context regions to create a set of normalized context regions; concatenating the normalized regions of interest with one another to create a concatenated region of interest; concatenating the normalized context regions with one another to create a concatenated context region; dimensionally reducing the concatenated region of interest to create a dimensionally reduced region of interest; dimensionally reducing the concatenated context region to create a dimensionally reduced context region; processing the dimensionally reduced region of interest and the dimensionally reduced context region in a second set of fully connected layers to generate a determined classification for the region of interest, wherein the second set of fully connected layers is trained on the desired classification; and if the determined classification corresponds to the desired classification, then annotating the image with an identification of the bounding box and storing the image and the identification in the computer memory. 2. The method according to claim 1 , wherein the normalizing of the at least one pooled feature map and each of the feature maps not pooled is performed using an L2 normalization. 3. The method according to claim 2 , wherein the normalization is performed within each pixel and each of the at least two feature maps is treated independently as follows: x ^ = x  x  2  x  2 = ( ∑ i = 1 d ⁢ ⁢  x i  ) 1 2 wherein x and {circumflex over (x)} stand for a corresponding original pixel vector and a corresponding normalized pixel vector, respectively, and d stands for a number of channels in each feature map tensor. 4. The method according to claim 3 , further comprising training the object detection system, wherein during the training, scaling factors γ i are updated to readjust the scale of the normalized features according to: y i =γ i {circumflex over (x)} i wherein γ i stands for the re-scaled feature value. 5. The method according to claim 1 , wherein the processing of the convolved region of interest to generate a determined classification includes using a softmax function. 6. The method according to claim 1 , wherein the desired classification is a human face. 7. The method according to claim 6 , wherein each context region is located based on likelihood of containing at least a portion of a human body. 8. The method according to claim 7 , wherein the human body is assumed to have a fixed spatial relation the human face. 9. The method according to claim 8 , wherein the fixed spatial relation has the human body vertical. 10. The method according to claim 9 , wherein the fixed spatial relation is represented by a set of parameters as follows: t x =( x b −x f )/ w f t y =( y b −y f )/ h f t w =log( w b /w f ) t h =log( h b /h f ) wherein x(*), y(*), w(*), and h(*) denote the two coordinates of the box center, width, and height respectively, b and f stand for the human body and the human face, respectively, and t x , t y , t w , and t h are the parameters. 11. The method according to claim 1 , wherein the annotating of the image to identify the bounding box includes adding a visual depiction of the bounding box to the image. 12. A computer-readable storage medium containing computer-executable instructions for performing a method of processing an image to detect the presence of one or more objects of a desired classification in the image, the method for being performed in an object-detection system executing the computer-executable instructions and comprising: receiving the image and storing the image in computer memory; sequentially convolving the image in a series of at least two convolution layers to create a corresponding series of feature maps of differing scales; pooling at least one of the feature maps to create a corresponding at least one pooled feature map; normalizing, relative to one another, the at least one pooled feature map and each of the feature maps not pooled to create a series of normalized feature maps; concatenating the series of normalized feature maps together with one another to create a concatenated feature map; dimensionally reducing the concatenated feature map to create a dimensionally reduced feature map; processing the dimensionally reduced feature map in a first set of fully connected layers to create a proposal comprising a bounding box corres

Assignees

Inventors

Classifications

  • using neural networks · CPC title

  • using classification, e.g. of video objects · CPC title

  • G06V40/161Primary

    Detection; Localisation; Normalisation · CPC title

  • Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title

  • Combinations of networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10354159B2 cover?
Methods of detecting an object in an image using a convolutional neural-network-based architecture that processes multiple feature maps of differing scales from differing convolution layers within a convolutional network to create a regional-proposal bounding box. The bounding box is projected back to the feature maps of the individual convolution layers to obtain a set of regions of interest (…
Who is the assignee on this patent?
Univ Carnegie Mellon
What technology area does this patent fall under?
Primary CPC classification G06V40/161. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 16 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).