Augmenting Layer-Based Object Detection With Deep Convolutional Neural Networks
US-2016180195-A1 · Jun 23, 2016 · US
US10915731B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10915731-B2 |
| Application number | US-201816228517-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 20, 2018 |
| Priority date | Jun 24, 2016 |
| Publication date | Feb 9, 2021 |
| Grant date | Feb 9, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Certain examples described herein enable semantically-labelled representations of a three-dimensional (3D) space to be generated from video data. In described examples, a 3D representation is a surface element or ‘surfel’ representation, where the geometry of the space is modelled using a plurality of surfaces that are defined within a 3D co-ordinate system. Object-label probability values for spatial elements of frames of video data may be determined using a two-dimensional image classifier. Surface elements that correspond to the spatial elements are identified based on a projection of the surface element representation using an estimated pose for a frame. Object-label probability values for the surface elements are then updated based on the object-label probability values for corresponding spatial elements. This results in a semantically-labelled 3D surface element representation of objects present in the video data. This data enables computer vision and/or robotic applications to make better use of the 3D representation.
Opening claim text (preview).
What is claimed is: 1. A method for detecting objects in video data, comprising: determining object-label probability values for spatial elements of frames of video data using a two-dimensional image classifier, wherein the object-label probability value for each of the spatial elements indicates a probability that the respective spatial element is an observation of a particular object; identifying surface elements in a three-dimensional surface element representation of a space observed in the frames of video data that correspond to the spatial elements, wherein a correspondence between a spatial element and a surface element is determined based on a projection of the surface element representation using an estimated pose for a frame; and updating object-label probability values for the surface elements based on the object-label probability values for corresponding spatial elements to provide a semantically-labelled three-dimensional surface element representation of objects present in the video data, wherein the object-label probability value for each of the surface elements indicates a probability that the respective surface element represents the particular object. 2. The method of claim 1 , wherein, during processing of said video data, the method comprises: detecting a loop closure event and applying a spatial deformation to the surface element representation, the spatial deformation modifying three-dimensional positions of surface elements in the surface element representation, wherein the spatial deformation modifies the correspondence between spatial elements and surface elements of the surface element representation such that, after the spatial deformation, object-label probability values for a first surface element are updated using object-label probability values for spatial elements that previously corresponded to a second surface element. 3. The method of claim 1 , comprising: processing the frames of video data without a pose graph to generate the three-dimensional surface element representation, including, on a frame-by-frame basis: comparing a rendered frame generated using the three-dimensional surface element representation with a video data frame from the frames of video data to determine a pose of a capture device for the video data frame; and updating the three-dimensional surface element representation using the pose and image data from the video data frame. 4. The method of claim 3 , wherein: a subset of the frames of video data used to generate the three-dimensional surface element representation are input to the two-dimensional image classifier. 5. The method of claim 1 , wherein the frames of video data comprise at least one of colour data, depth data and normal data; and wherein the two-dimensional image classifier is configured to compute object-label probability values based on at least one of colour data, depth data and normal data for a frame. 6. The method of claim 1 , wherein the two-dimensional image classifier comprises a convolutional neural network. 7. The method of claim 6 , wherein the convolutional neural network is configured to output the object-label probability values as a set of pixel maps for each frame of video data, each pixel map in the set corresponding to a different object label in a set of available object labels. 8. The method of claim 6 , wherein the two-dimensional image classifier comprises a deconvolutional neural network communicatively coupled to the output of the convolutional neural network. 9. The method of claim 1 , comprising, after the updating of the object-label probability values for the surface elements: regularising the object-label probability values for the surface elements. 10. The method of claim 9 , wherein regularising comprises: applying a conditional random field to the object-label probability values for surface elements in the surface element representation. 11. The method of claim 9 , wherein regularising the object-label probability values comprises: regularising the object-label probability values assigned to surface elements based on one or more of: surface element positions, surface element colours, and surface element normals. 12. The method of claim 1 , comprising: replacing a set of one or more surface elements with a three-dimensional object definition based on the object-label probability values assigned to said surface elements. 13. The method of claim 1 , comprising: annotating surface elements of a three-dimensional surface element representation of a space with object-labels to provide an annotated representation; generating annotated frames of video data from the annotated representation based on a projection of the annotated representation, the projection using an estimated pose for each annotated frame, each annotated frame comprising spatial elements with assigned object-labels; and training the two-dimensional image classifier using the annotated frames of video data. 14. The method of claim 1 , comprising: obtaining a first frame of video data corresponding to an observation of a first portion of an object; generating an image map for the first frame of video data using the two-dimensional image classifier, said image map indicating the presence of the first portion of the object in an area of the first frame; and determining that a surface element does not project onto the area in the first frame and as such not updating object-label probability values for the surface element based image map values in said area; wherein following detection of a loop closure event the method comprises: modifying a three-dimensional position of the surface element; obtaining a second frame of video data corresponding to a repeated observation of the first portion of the object; generating an image map for the second frame of video data using the two-dimensional image classifier, said image map indicating the presence of the first portion of the object in an area of the second frame; determining that the modified first surface element does project onto the area of the second frame following the loop closure event; and updating object-label probability values for the surface element based on the image map for the second frame of video data, wherein the object-label probability values for the surface element include fused object predictions for the surface element from multiple viewpoints. 15. Apparatus for detecting objects in video data comprising: an image-classifier interface to receive two-dimensional object-label probability distributions for spatial elements of individual frames of video data, wherein the object-label probability distribution for each of the spatial elements includes a set of object-label probability values, and each of the set of object-label probability values indicates a probability that the respective spatial element is an observation of a different respective object; a correspondence interface to receive data indicating, for a given frame of video data, a correspondence between spatial elements within the given frame and surface elements in a three-dimensional surface element representation, said correspondence being determined based on a projection of the surface element representation using an estimated pose for the given frame; and a semantic augmenter to iteratively update object-label probability values assigned to individual surface elements in the three-dimensional surface element representation, wherein the semantic augmenter is configured to use, for a given frame of video data, the data received by the correspondence interface to apply the two-dimensional obj
Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title
Three-dimensional [3D] objects · CPC title
by matching two-dimensional images to three-dimensional objects · CPC title
characterised by the process organisation or structure, e.g. boosting cascade · CPC title
based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.