What technology area does this patent fall under?

Primary CPC classification G06V20/58. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Apr 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Object identification in bird's-eye view reference frame with explicit depth estimation co-training

US12266190B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12266190-B2
Application number	US-202217884356-A
Country	US
Kind code	B2
Filing date	Aug 9, 2022
Priority date	Aug 9, 2022
Publication date	Apr 1, 2025
Grant date	Apr 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The described aspects and implementations enable efficient detection and classification of objects with machine learning models that deploy a bird's-eye view representation and are trained using depth ground truth data. In one implementation, disclosed are system and techniques that include obtaining images, generating, using a first neural network (NN), feature vectors (FVs) and depth distributions pixels of images, wherein the first NN is trained using training images and a depth ground truth data for the training images. The techniques further include obtaining a feature tensor (FT) in view of the FVs and the depth distributions, and processing the obtained FTs, using a second NN, to identify one or more objects depicted in the images.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: obtaining one or more perspective camera images of an environment; generating, using a first neural network (NN), for each pixel of a set of pixels of the one or more perspective camera images, a feature vector (FV), and a depth distribution for a portion of the environment imaged by a corresponding pixel, wherein the first NN is trained using a plurality of training images and a depth ground truth data for the plurality of training images; obtaining, for each pixel of the set of pixels, a feature tensor (FT) in view of (i) the FV for a respective pixel and (ii) the depth distribution for the respective pixel; and processing the obtained FTs, using a second NN, to identify one or more objects in the environment. 2. The method of claim 1 , wherein processing the obtained FTs comprises: obtaining a combined FT using the FTs for the set of pixels; mapping the combined FT to a ground surface to obtain a projected FT; and using the second NN to process the projected FT. 3. The method of claim 2 , wherein mapping the combined FT to the ground surface comprises: transforming the combined FT to a set of coordinates associated with the ground surface; and aggregating elements of the combined FT in a vertical direction to obtain the projected FT. 4. The method of claim 2 , wherein the one or more perspective camera images are associated with a first time, the method further comprising: obtaining one or more additional perspective camera images associated with at least a second time; generating, using the one or more additional perspective camera images, an additional projected FT; and performing a concurrent processing of the projected FT and the additional projected FT. 5. The method of claim 4 , wherein the concurrent processing is performed by an aggregation NN comprising one or more convolutional kernels configured to aggregate elements of the projected FT with elements of the additional projected FT. 6. The method of claim 1 , wherein the second NN comprises: a first classification head configured to output semantic segmentation for the one or more objects in the environment; and at least one second classification head configured to output geometric information associated with locations of the one or more objects in the environment. 7. The method of claim 1 , wherein the depth ground truth data comprises a depth estimate for at least a subset of pixels of the plurality of training images, wherein the depth estimate is output by a first NN of a teacher model. 8. The method of claim 7 , wherein the second NN is trained using outputs of a second NN of the teacher model. 9. The method of claim 1 , wherein the FT for each pixel of the set of pixels is output by a first subnetwork of the first NN, wherein the depth distribution for each pixel of the set of pixels is output by a second subnetwork of the first NN, and wherein the second subnetwork is trained, using the depth ground truth data, prior to training of the first subnetwork. 10. The method of claim 1 , wherein the depth ground truth data comprises lidar-determined distances to one or more objects in at least a subset of the plurality of training images. 11. A method of training a student model, the method comprising: obtaining a training image; processing, using a first neural network (NN) of the student model, the training image to generate a plurality of feature vectors (FVs), and a plurality of depth distributions, wherein each FV of the plurality of FVs and each depth distribution of the plurality of depth distributions are associated with a respective pixel of a plurality of pixels of the training image; obtaining a plurality of ground truth FVs generated by a first NN of a teacher model, wherein each ground truth FV of the plurality of ground truth FVs is associated with a respective pixel of the plurality of pixels of the training image; obtaining a plurality of ground truth depth indicators, wherein each ground truth depth indicator of the plurality of ground truth depth indicators is associated with a respective pixel of at least a subset of the plurality of pixels of the training image; and adjusting parameters of the first NN of the student model based on a comparison of the plurality of FVs with the plurality of ground truth FVs, and a comparison of the plurality of depth distributions with the plurality of ground truth depth indicators. 12. The method of claim 11 , further comprising: obtaining a plurality of feature tensors (FTs), wherein each FT of the plurality of FTs is obtained using a respective FV of the plurality of FVs and a respective depth distribution of the plurality of depth distributions; obtaining a combined FT using the plurality of FTs; mapping the combined FT to a ground surface to obtain a projected FT; processing the projected FT, using a second NN of the student model, to identify one or more objects in the training image; obtaining one or more ground truth objects identified by a second NN of the teacher model in the training image; and adjusting parameters of the second NN of the student model based on a comparison of the one or more objects identified by the second NN of the student model with the one or more objects identified by the second NN of the teacher model. 13. The method of claim 11 , wherein each of the plurality of ground truth depth indicators comprises at least one of (i) a depth distribution obtained by the first NN of the teacher model for the respective pixel, or (ii) a distance, obtained by a range-sensing device, to a portion of an environment imaged by the respective pixel. 14. A system comprising: a memory; and a processing device communicative coupled to the memory, the processing device configured to: obtain one or more perspective camera images of an environment; generate, using a first neural network (NN), for each pixel of a set of pixels of the one or more perspective camera images, a feature vector (FV), and a depth distribution for a portion of the environment imaged by a corresponding pixel, wherein the first NN is trained using a plurality of training images and a depth ground truth data for the plurality of training images; obtain, for each pixel of the set of pixels, a feature tensor (FT) in view of (i) the FV for a respective pixel and (ii) the depth distribution for the respective pixel; and process the obtained FTs, using a second NN, to identify one or more objects in the environment. 15. The system of claim 14 , wherein to process the obtained FTs, the processing device is to: obtain a combined FT using the FTs for the set of pixels; map the combined FT to a ground surface to obtain a projected FT; and use the second NN to process the projected FT. 16. The system of claim 15 , wherein to map the combined FT to the ground surface, the processing device is to: transform the combined FT to a set of coordinates associated with the ground surface; and aggregate elements of the combined FT in a vertical direction to obtain the projected FT. 17. The system of claim 15 , wherein the one or more perspective camera images are associated with a first time, and wherein the processing device is further to: obtain one or more additional perspective camera images associated with at least a second time; generate, using the one or more additional perspective camera images, an additional projected FT; and perform a concurrent processing of the projected FT and the additional projected FT, wherein the concurrent processing is performed by an aggregation

Assignees

Waymo Llc

Inventors

Classifications

G06V30/18086
by performing operations within image blocks or by using histograms · CPC title
G06V10/40
Extraction of image or video features · CPC title
G06V20/698
Matching; Classification · CPC title
G06V10/70
using pattern recognition or machine learning (optical pattern recognition or electronic computations therefor G06V10/88) · CPC title
G06T3/4046
using neural networks · CPC title

Patent family

Related publications grouped by family.

View patent family 87847954

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12266190B2 cover?: The described aspects and implementations enable efficient detection and classification of objects with machine learning models that deploy a bird's-eye view representation and are trained using depth ground truth data. In one implementation, disclosed are system and techniques that include obtaining images, generating, using a first neural network (NN), feature vectors (FVs) and depth distribu…
Who is the assignee on this patent?: Waymo Llc
What technology area does this patent fall under?: Primary CPC classification G06V20/58. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Apr 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Automated Building Information Determination Using Inter-Image Analysis Of Multiple Building Images

InSeGAN: A Generative Approach to Instance Segmentation in Depth Images

Systems and Methods for End-to-End Trajectory Prediction Using Radar, Lidar, and Maps

Deep neural network for segmentation of road scenes and animate object instances for autonomous driving applications

Multi-Task Multi-Sensor Fusion for Three-Dimensional Object Detection

Frequently asked questions