Deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition
US-2018211099-A1 · Jul 26, 2018 · US
US10354159B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10354159-B2 |
| Application number | US-201715697015-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 6, 2017 |
| Priority date | Sep 6, 2016 |
| Publication date | Jul 16, 2019 |
| Grant date | Jul 16, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods of detecting an object in an image using a convolutional neural-network-based architecture that processes multiple feature maps of differing scales from differing convolution layers within a convolutional network to create a regional-proposal bounding box. The bounding box is projected back to the feature maps of the individual convolution layers to obtain a set of regions of interest (ROIs) and a corresponding set of context regions that provide additional context for the ROIs. These ROIs and context regions are processed to create a confidence score representing a confidence that the object detected in the bounding box is the desired object. These processes allow the method to utilize deep features encoded in both the global and the local representation for object regions, allowing the method to robustly deal with challenges in the problem of object detection. Software for executing the disclosed methods within an object-detection system is also disclosed.
Opening claim text (preview).
What is claimed is: 1. A method of processing an image to detect the presence of one or more objects of a desired classification in the image, the method being performed in an object-detection system and comprising: receiving the image and storing the image in computer memory; sequentially convolving the image in a series of at least two convolution layers to create a corresponding series of feature maps of differing scales; pooling at least one of the feature maps to create a corresponding at least one pooled feature map; normalizing, relative to one another, the at least one pooled feature map and each of the feature maps not pooled to create a series of normalized feature maps; concatenating the series of normalized feature maps together with one another to create a concatenated feature map; dimensionally reducing the concatenated feature map to create a dimensionally reduced feature map; processing the dimensionally reduced feature map in a first set of fully connected layers to create a proposal comprising a bounding box corresponding to a suspected object of the desired classification in the image and an objectness score for the suspected object, wherein the first set of fully connected layers has been trained on the desired classification; if the objectness score exceeds a predetermined threshold, then projecting the bounding box back to each of the at least two feature maps to identify a region of interest in each of the at least two feature maps; identify a context region for each region of interest; pooling each of the regions of interest to create a corresponding pooled region of interest; pooling each of the context regions to create a corresponding pooled context region; normalizing, relative to one another, the pooled regions of interest to create a set of normalized regions of interest; normalizing, relative to one another, the pooled context regions to create a set of normalized context regions; concatenating the normalized regions of interest with one another to create a concatenated region of interest; concatenating the normalized context regions with one another to create a concatenated context region; dimensionally reducing the concatenated region of interest to create a dimensionally reduced region of interest; dimensionally reducing the concatenated context region to create a dimensionally reduced context region; processing the dimensionally reduced region of interest and the dimensionally reduced context region in a second set of fully connected layers to generate a determined classification for the region of interest, wherein the second set of fully connected layers is trained on the desired classification; and if the determined classification corresponds to the desired classification, then annotating the image with an identification of the bounding box and storing the image and the identification in the computer memory. 2. The method according to claim 1 , wherein the normalizing of the at least one pooled feature map and each of the feature maps not pooled is performed using an L2 normalization. 3. The method according to claim 2 , wherein the normalization is performed within each pixel and each of the at least two feature maps is treated independently as follows: x ^ = x x 2 x 2 = ( ∑ i = 1 d x i ) 1 2 wherein x and {circumflex over (x)} stand for a corresponding original pixel vector and a corresponding normalized pixel vector, respectively, and d stands for a number of channels in each feature map tensor. 4. The method according to claim 3 , further comprising training the object detection system, wherein during the training, scaling factors γ i are updated to readjust the scale of the normalized features according to: y i =γ i {circumflex over (x)} i wherein γ i stands for the re-scaled feature value. 5. The method according to claim 1 , wherein the processing of the convolved region of interest to generate a determined classification includes using a softmax function. 6. The method according to claim 1 , wherein the desired classification is a human face. 7. The method according to claim 6 , wherein each context region is located based on likelihood of containing at least a portion of a human body. 8. The method according to claim 7 , wherein the human body is assumed to have a fixed spatial relation the human face. 9. The method according to claim 8 , wherein the fixed spatial relation has the human body vertical. 10. The method according to claim 9 , wherein the fixed spatial relation is represented by a set of parameters as follows: t x =( x b −x f )/ w f t y =( y b −y f )/ h f t w =log( w b /w f ) t h =log( h b /h f ) wherein x(*), y(*), w(*), and h(*) denote the two coordinates of the box center, width, and height respectively, b and f stand for the human body and the human face, respectively, and t x , t y , t w , and t h are the parameters. 11. The method according to claim 1 , wherein the annotating of the image to identify the bounding box includes adding a visual depiction of the bounding box to the image. 12. A computer-readable storage medium containing computer-executable instructions for performing a method of processing an image to detect the presence of one or more objects of a desired classification in the image, the method for being performed in an object-detection system executing the computer-executable instructions and comprising: receiving the image and storing the image in computer memory; sequentially convolving the image in a series of at least two convolution layers to create a corresponding series of feature maps of differing scales; pooling at least one of the feature maps to create a corresponding at least one pooled feature map; normalizing, relative to one another, the at least one pooled feature map and each of the feature maps not pooled to create a series of normalized feature maps; concatenating the series of normalized feature maps together with one another to create a concatenated feature map; dimensionally reducing the concatenated feature map to create a dimensionally reduced feature map; processing the dimensionally reduced feature map in a first set of fully connected layers to create a proposal comprising a bounding box corres
using neural networks · CPC title
using classification, e.g. of video objects · CPC title
Detection; Localisation; Normalisation · CPC title
Determination of region of interest [ROI] or a volume of interest [VOI] · CPC title
Combinations of networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.