Method and apparatus for neural network training and construction and method and apparatus for object detection
US-2018032840-A1 · Feb 1, 2018 · US
US10424064B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10424064-B2 |
| Application number | US-201615296845-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 18, 2016 |
| Priority date | Oct 18, 2016 |
| Publication date | Sep 24, 2019 |
| Grant date | Sep 24, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Certain aspects involve semantic segmentation of objects in a digital visual medium by determining a score for each pixel of the digital visual medium that is representative of a likelihood that each pixel corresponds to the objects associated with bounding boxes within the digital visual medium. An instance-level label that yields a label for each of the pixels of the digital visual medium corresponding to the objects is determined based, in part, on a collective probability map including the score for each pixel of the digital visual medium. In some aspects, the score for each pixel corresponding to each bounding box is determined by a prediction model trained by a neural network.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for semantic segmentation of one or more objects in a digital visual medium, comprising: accessing, by a processing device, a set of bounding boxes potentially corresponding to a set of target objects within the digital visual medium; for each of the set of bounding boxes, determining, by the processing device, a pixel score for each pixel of the digital visual medium corresponding to the set of bounding boxes, the pixel score being representative of a likelihood that each pixel corresponds to the set of target objects associated with the set of bounding boxes; determining, by the processing device and for each pixel of the digital visual medium, an instance-level label that distinguishes a first set of pixels corresponding to a first object from a second set of pixels corresponding to a second object of a same class as the first object, each instance-level label determined based, at least in part, on a collective probability map including the pixel score for each pixel; and applying, by the processing device, at least some of the determined instance-level labels to at least some of the pixels of the digital visual medium. 2. The computer-implemented method of claim 1 , wherein determining the pixel score comprises employing a prediction model trained by a neural network. 3. The computer-implemented method of claim 2 , wherein the method further comprises training the neural network, said training comprising: receiving, by the processing device, a training visual medium having a first bounding box corresponding to a training target object within the training visual medium; generating, by the processing device and based on the first bounding box, a plurality of bounding boxes corresponding to the training target object within the training visual medium, the first bounding box and the plurality of bounding boxes together forming a training set of bounding boxes; generating, by the processing device, a plurality of distance maps, each distance map in the plurality of distance maps corresponding to a respective bounding box of the training set of bounding boxes; concatenating, by the processing device, the training visual medium with each distance map in the plurality of distance maps to generate a plurality of training pairs; and training, by the processing device and based on at least one training pair of the plurality of training pairs, the neural network to segment pixels of the training visual medium corresponding to the training target object. 4. The computer-implemented method of claim 3 , wherein the neural network is a convolutional encoder-decoder network including: a convolutional encoder network having one or more convolutional layers for training filters to recognize one or more features of the one or more target objects, and one or more pooling layers for manipulating a spatial size of the at least one training pair; and a convolutional decoder network having one or more deconvolutional layers and one or more unpooling layers for reconstructing details of the digital visual medium, wherein training the neural network based on the at least one training pair includes inputting the at least one training pair to the convolutional encoder network and the convolutional decoder network to generate a binary instance mask corresponding to the training target object. 5. The computer-implemented method of claim 1 , wherein the set of bounding boxes is received based on an object detection algorithm, wherein receiving the set of bounding boxes includes receiving class scores associated with the set of bounding boxes. 6. The computer-implemented method of claim 1 , wherein the set of bounding boxes is received based on an object detection algorithm, wherein class scores corresponding to the set of bounding boxes are received based on a classification algorithm. 7. The computer-implemented method of claim 1 , wherein the collective probability map is generated based on a plurality of probability maps for each bounding box of the set of bounding boxes, wherein each probability map of the plurality of probability maps is weighted based on class scores corresponding to each bounding box. 8. The computer-implemented method of claim 1 , wherein determining the instance-level label includes using probabilities of the collection probability map to identify a compatibility between adjacent pixels corresponding to at least one of the set of target objects, the compatibility being identified using a conditional random field model. 9. A computing system for semantic segmentation of one or more objects in a digital visual medium, the computing system comprising: means for storing a plurality of digital media, the digital media including a digital visual medium having a bounding box set, the bounding box set including at a first bounding box potentially corresponding to a target object within the digital visual medium and a second bounding box potentially corresponding to a second target object within the digital visual medium; and means for determining, for each bounding box in the bounding box set, a pixel score for each pixel of the digital visual medium corresponding to each bounding box of the bounding box set, the pixel score being representative of a likelihood that each pixel corresponds to the target object associated with the at least one bounding box, said means being communicatively coupled to the means for storing the plurality of digital media; means for determining for each pixel of the digital visual medium, an instance-level label that distinguishes a first set of pixels corresponding to the first bounding box from a second set of pixels corresponding the second bounding box, each instance-level label determined based, at least in part, on a collective probability map including the pixel score for each pixel; and means for assigning at least some of the determined instance-level labels to at least some of the pixels in the digital visual medium. 10. The computing system of claim 9 , wherein the means for determining the pixel score includes a neural network and a prediction model trained by the neural network. 11. The computing system of claim 10 , further comprising a means for training the neural network by performing operations comprising: generating, based a training visual medium having a training target object and a first bounding box corresponding to the training target object, a plurality of bounding boxes corresponding to the training target object, the first bounding box and the plurality of bounding boxes together forming a training set of bounding boxes; generating a plurality of distance maps, each distance map in the plurality of distance maps corresponding to a respective bounding box of the training set of bounding boxes; concatenating the training visual medium with each distance map in the plurality of distance maps to generate a plurality of training pairs; and training, based on at least one training pair of the plurality of training pairs, the neural network to segment pixels of the training visual medium corresponding to the training target object. 12. The computing system of claim 11 , wherein the neural network is a convolutional encoder-decoder network including: a convolutional encoder network having one or more convolutional layers for training filters to recognize one or more features of the target object and one or more pooling layers for manipulating a spatial size of the at least one training pair; and a convolutional decoder network having one or more deconvolutional layers and one or more unpooling layers for reconstructing details of the digital visual medium.
Related publications grouped by family.
Answers are generated from the same data shown on this page.