Three dimensional bounding box estimation from two dimensional images
US-2019340432-A1 · Nov 7, 2019 · US
US10970518B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10970518-B1 |
| Application number | US-201816188879-A |
| Country | US |
| Kind code | B1 |
| Filing date | Nov 13, 2018 |
| Priority date | Nov 14, 2017 |
| Publication date | Apr 6, 2021 |
| Grant date | Apr 6, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A voxel feature learning network receives a raw point cloud and converts the point cloud into a sparse 4D tensor comprising three-dimensional coordinates (e.g. X, Y, and Z) for each voxel of a plurality of voxels and a fourth voxel feature dimension for each non-empty voxel. In some embodiments, convolutional mid layers further transform the 4D tensor into a high-dimensional volumetric representation of the point cloud. In some embodiments, a region proposal network identifies 3D bounding boxes of objects in the point cloud based on the high-dimensional volumetric representation. In some embodiments, the feature learning network and the region proposal network are trained end-to-end using training data comprising known ground truth bounding boxes, without requiring human intervention.
Opening claim text (preview).
What is claimed is: 1. A system, comprising one or more processors and a memory coupled to the one or more processors, wherein the memory comprises program instructions configured to: implement, via the one or more processors, a voxel feature learning network, wherein the voxel feature learning network is configured to: receive a point cloud comprising a plurality of points located in a three-dimensional space; group respective sets of the points of the point cloud into respective voxels, wherein the respective points are grouped into the respective voxels based on locations of the respective points in the three-dimensional space and locations of the voxels in the three-dimensional space, wherein each voxel corresponds to a volume segment of the three-dimensional space; determine, for each of one or more of the respective voxels, a plurality of point-wise concatenated features from the respective points included in the respective voxel, wherein to determine the point-wise concatenated features for a given one of the respective voxels, the program instructions are configured to: identify a plurality of point-wise determined features; determine a locally aggregated feature from the identified plurality of point-wise determined features via element-wise max-pooling across the plurality of point-wise determined features; and augment, based on the locally aggregated feature, the respective ones of the point-wise features for the points to form respective point-wise concatenated features; determine, for each of one or more of the respective voxels, a voxel feature, wherein the voxel feature is determined based on the plurality of point-wise concatenated features determined from the respective points included in the voxel; and provide a four-dimensional (4D) tensor representation of the point cloud comprising the determined voxel features. 2. The system of claim 1 , further comprising: one or more Lidar sensors configured to capture the plurality of points that make up the point cloud, wherein the point cloud is received by the voxel feature learning network as a raw point cloud captured by the one or more Lidar sensors. 3. The system of claim 1 , wherein the one or more processors and the memory, or one or more additional processors and an additional memory, include program instructions configured to implement: one or more convolutional middle layers configured to process the 4D tensor into a high-dimensional volumetric representation of the point cloud; and a region proposal network configured to generate a three dimensional (3D) object detection output determined based at least in part on the high-dimensional volumetric representation of the point cloud. 4. The system of claim 3 , wherein the voxel feature learning network comprises a first fully-connected neural network, and wherein the region proposal network comprises an additional fully-connected neural network. 5. The system of claim 4 , wherein the first fully-connected neural network is trained to identify voxel features based on training data, wherein the training data comprises ground truth 3D bounding boxes corresponding to objects included in a training point cloud, and wherein errors between the ground truth 3D bounding boxes and 3D bounding boxes identified by the region proposal network are used to train the first fully connected neural network and the additional fully connected neural network. 6. The system of claim 5 , wherein the program instructions are configured to cause the system to augment the training data, wherein to augment the training data, the program instructions cause the one or more processors to: apply a perturbation to each ground truth 3D bounding box and independently apply a perturbation to the points of the training point cloud that are included in the respective ground truth 3D bounding boxes; apply a global scaling to each ground truth 3D bounding box; or apply a global rotation to each ground truth 3D bounding box and to the training point cloud, wherein the global rotation simulates a vehicle making a turn. 7. The system of claim 1 , wherein, for each of the one or more voxels, to determine the voxel feature based on the point-wise concatenated features identified from the respective points included in the voxel, the voxel feature learning network is configured to: transform the plurality of concatenated point-wise features, via a fully-connected neural network, into the voxel feature, wherein the fully connected neural network comprises a linear layer, a batch normalization layer, and/or a rectified linear unit (ReLU) layer. 8. The system of claim 7 , wherein element-wise max-pooling is applied to transform the plurality of concatenated point-wise features into the voxel feature. 9. The system of claim 1 , wherein to group the respective sets of the points of the point cloud into the respective voxels, the voxel feature learning network is configured to: for a voxel for which a set of respective points in three-dimensional space corresponding to the voxel is less than a threshold number of points, assign all of the respective points in three-dimensional space corresponding to the voxel to the voxel; and for another voxel for which another set of respective points in three-dimensional space corresponding to the other voxel is greater than the threshold number of points, randomly sample the respective points in three-dimensional space corresponding to the other voxel for inclusion in the other voxel, such that the number of points included in the other voxel is less than or equal to the threshold number of points. 10. The system of claim 1 , wherein the voxel feature learning network is implemented via a plurality of processors, and wherein the plurality of processors are configured to determine, in parallel, respective voxel features for a plurality of respective voxels. 11. A computer implemented method, comprising: performing by one or more computers: subdividing a three-dimensional (3D) space of a point cloud into equally spaced voxels, wherein the point cloud includes information regarding one or more points within a 3D coordinate system, and wherein each of the one or more points resides in one of the voxels; grouping the points of the point cloud according the particular respective voxels in which they reside; determining, for each of one or more of the respective voxels, a plurality of point-wise concatenated features based on the selected points residing in the respective voxel, wherein determining the point-wise concatenated features for a given voxel of the respective voxels comprises: identifying a plurality of point-wise determined features; determining a locally aggregated feature from the identified plurality of point-wise determined features via element-wise max-pooling across the plurality of point-wise determined features; and augmenting, based on the locally aggregated feature, the respective ones of the point-wise features for the points to form respective point-wise concatenated features; determining voxel features based on the point-wise concatenated features determined for the respective one or more voxels; and representing the determined voxel features as a sparse 4D tensor. 12. The computer implemented method of claim 11 , wherein said determining the plurality of point-wise concatenated features further comprises: computing a local mean as a centroid of points within the given voxel; augmenting each point residing within the given voxel with an offset relative to the computed centroid; transforming each augmented point into a feature space; encoding, based on the augmented points transformed into the feature space, t
Organisation of the process, e.g. bagging or boosting · CPC title
Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads · CPC title
using feature-based methods · CPC title
using neural networks · CPC title
by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.