Automated Building Information Determination Using Inter-Image Analysis Of Multiple Building Images
US-2023206393-A1 · Jun 29, 2023 · US
US12266190B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12266190-B2 |
| Application number | US-202217884356-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 9, 2022 |
| Priority date | Aug 9, 2022 |
| Publication date | Apr 1, 2025 |
| Grant date | Apr 1, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The described aspects and implementations enable efficient detection and classification of objects with machine learning models that deploy a bird's-eye view representation and are trained using depth ground truth data. In one implementation, disclosed are system and techniques that include obtaining images, generating, using a first neural network (NN), feature vectors (FVs) and depth distributions pixels of images, wherein the first NN is trained using training images and a depth ground truth data for the training images. The techniques further include obtaining a feature tensor (FT) in view of the FVs and the depth distributions, and processing the obtained FTs, using a second NN, to identify one or more objects depicted in the images.
Opening claim text (preview).
What is claimed is: 1. A method comprising: obtaining one or more perspective camera images of an environment; generating, using a first neural network (NN), for each pixel of a set of pixels of the one or more perspective camera images, a feature vector (FV), and a depth distribution for a portion of the environment imaged by a corresponding pixel, wherein the first NN is trained using a plurality of training images and a depth ground truth data for the plurality of training images; obtaining, for each pixel of the set of pixels, a feature tensor (FT) in view of (i) the FV for a respective pixel and (ii) the depth distribution for the respective pixel; and processing the obtained FTs, using a second NN, to identify one or more objects in the environment. 2. The method of claim 1 , wherein processing the obtained FTs comprises: obtaining a combined FT using the FTs for the set of pixels; mapping the combined FT to a ground surface to obtain a projected FT; and using the second NN to process the projected FT. 3. The method of claim 2 , wherein mapping the combined FT to the ground surface comprises: transforming the combined FT to a set of coordinates associated with the ground surface; and aggregating elements of the combined FT in a vertical direction to obtain the projected FT. 4. The method of claim 2 , wherein the one or more perspective camera images are associated with a first time, the method further comprising: obtaining one or more additional perspective camera images associated with at least a second time; generating, using the one or more additional perspective camera images, an additional projected FT; and performing a concurrent processing of the projected FT and the additional projected FT. 5. The method of claim 4 , wherein the concurrent processing is performed by an aggregation NN comprising one or more convolutional kernels configured to aggregate elements of the projected FT with elements of the additional projected FT. 6. The method of claim 1 , wherein the second NN comprises: a first classification head configured to output semantic segmentation for the one or more objects in the environment; and at least one second classification head configured to output geometric information associated with locations of the one or more objects in the environment. 7. The method of claim 1 , wherein the depth ground truth data comprises a depth estimate for at least a subset of pixels of the plurality of training images, wherein the depth estimate is output by a first NN of a teacher model. 8. The method of claim 7 , wherein the second NN is trained using outputs of a second NN of the teacher model. 9. The method of claim 1 , wherein the FT for each pixel of the set of pixels is output by a first subnetwork of the first NN, wherein the depth distribution for each pixel of the set of pixels is output by a second subnetwork of the first NN, and wherein the second subnetwork is trained, using the depth ground truth data, prior to training of the first subnetwork. 10. The method of claim 1 , wherein the depth ground truth data comprises lidar-determined distances to one or more objects in at least a subset of the plurality of training images. 11. A method of training a student model, the method comprising: obtaining a training image; processing, using a first neural network (NN) of the student model, the training image to generate a plurality of feature vectors (FVs), and a plurality of depth distributions, wherein each FV of the plurality of FVs and each depth distribution of the plurality of depth distributions are associated with a respective pixel of a plurality of pixels of the training image; obtaining a plurality of ground truth FVs generated by a first NN of a teacher model, wherein each ground truth FV of the plurality of ground truth FVs is associated with a respective pixel of the plurality of pixels of the training image; obtaining a plurality of ground truth depth indicators, wherein each ground truth depth indicator of the plurality of ground truth depth indicators is associated with a respective pixel of at least a subset of the plurality of pixels of the training image; and adjusting parameters of the first NN of the student model based on a comparison of the plurality of FVs with the plurality of ground truth FVs, and a comparison of the plurality of depth distributions with the plurality of ground truth depth indicators. 12. The method of claim 11 , further comprising: obtaining a plurality of feature tensors (FTs), wherein each FT of the plurality of FTs is obtained using a respective FV of the plurality of FVs and a respective depth distribution of the plurality of depth distributions; obtaining a combined FT using the plurality of FTs; mapping the combined FT to a ground surface to obtain a projected FT; processing the projected FT, using a second NN of the student model, to identify one or more objects in the training image; obtaining one or more ground truth objects identified by a second NN of the teacher model in the training image; and adjusting parameters of the second NN of the student model based on a comparison of the one or more objects identified by the second NN of the student model with the one or more objects identified by the second NN of the teacher model. 13. The method of claim 11 , wherein each of the plurality of ground truth depth indicators comprises at least one of (i) a depth distribution obtained by the first NN of the teacher model for the respective pixel, or (ii) a distance, obtained by a range-sensing device, to a portion of an environment imaged by the respective pixel. 14. A system comprising: a memory; and a processing device communicative coupled to the memory, the processing device configured to: obtain one or more perspective camera images of an environment; generate, using a first neural network (NN), for each pixel of a set of pixels of the one or more perspective camera images, a feature vector (FV), and a depth distribution for a portion of the environment imaged by a corresponding pixel, wherein the first NN is trained using a plurality of training images and a depth ground truth data for the plurality of training images; obtain, for each pixel of the set of pixels, a feature tensor (FT) in view of (i) the FV for a respective pixel and (ii) the depth distribution for the respective pixel; and process the obtained FTs, using a second NN, to identify one or more objects in the environment. 15. The system of claim 14 , wherein to process the obtained FTs, the processing device is to: obtain a combined FT using the FTs for the set of pixels; map the combined FT to a ground surface to obtain a projected FT; and use the second NN to process the projected FT. 16. The system of claim 15 , wherein to map the combined FT to the ground surface, the processing device is to: transform the combined FT to a set of coordinates associated with the ground surface; and aggregate elements of the combined FT in a vertical direction to obtain the projected FT. 17. The system of claim 15 , wherein the one or more perspective camera images are associated with a first time, and wherein the processing device is further to: obtain one or more additional perspective camera images associated with at least a second time; generate, using the one or more additional perspective camera images, an additional projected FT; and perform a concurrent processing of the projected FT and the additional projected FT, wherein the concurrent processing is performed by an aggregation
by performing operations within image blocks or by using histograms · CPC title
Extraction of image or video features · CPC title
Matching; Classification · CPC title
using pattern recognition or machine learning (optical pattern recognition or electronic computations therefor G06V10/88) · CPC title
using neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.