Searching an autonomous vehicle sensor data repository based on context embedding
US-2022164350-A1 · May 26, 2022 · US
US12497079B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12497079-B2 |
| Application number | US-202318335915-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 15, 2023 |
| Priority date | Jun 15, 2022 |
| Publication date | Dec 16, 2025 |
| Grant date | Dec 16, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and apparatus for generating trajectory predictions for one or more target agents. In one aspect, a system comprises one or more computers configured to obtain scene context data characterizing a scene in an environment at a current time point, where the scene includes multiple agents that include a target agent and one or more context agents, and the scene context data includes respective context data for each of multiple different modalities of context data. The one or more computers then generate an encoded representation of the scene in the environment that includes one or more embeddings and process the encoded representation of the scene context data using a decoder neural network to generate a trajectory prediction output for the target agent that predicts a future trajectory of the target after the current time point.
Opening claim text (preview).
What is claimed is: 1 . A method performed by one or more computers, the method comprising: obtaining scene context data characterizing a scene in an environment at a current time point, wherein the scene includes a plurality of agents comprising a target agent and one or more context agents, and the scene context data comprises respective context data for each of multiple different modalities of context data, wherein the scene context data comprises data generated from data captured by one or more sensors of an autonomous vehicle, and wherein the target agent in the set is an agent in a vicinity of the autonomous vehicle in the environment; generating an encoded representation of the scene in the environment that comprises one or more embeddings, comprising: generating, for each of the multiple different modalities, a respective sequence of input elements for the modality from the context data for the modality; generating a combined sequence by concatenating the respective sequences of input elements for each of the different modalities; and processing the combined sequence using an attention-based encoder neural network to generate the one or more embeddings, wherein the attention-based encoder neural network comprises at least one cross-modal attention layer block that attends over input elements corresponding to each of the multiple different modalities; processing the encoded representation of the scene context data using a decoder neural network to generate a trajectory prediction output for the target agent that predicts a future trajectory of the target agent after the current time point; and providing at least one of the trajectory prediction output for the target agent or data derived from the trajectory prediction output to a planning system to control navigation of the autonomous vehicle. 2 . The method of claim 1 , wherein the trajectory prediction output defines a probability distribution over possible future trajectories of the target agent after the current time point. 3 . The method of claim 1 , wherein the trajectory prediction output is generated on-board the autonomous vehicle. 4 . The method of claim 1 , wherein the scene context data comprises target agent history context data characterizing current and previous states of the target agent. 5 . The method of claim 1 , wherein the scene context data comprises context agent history context data characterizing current and previous states of each of the one or more context agents. 6 . The method of claim 1 , wherein the scene context data comprises road graph context data characterizing road features in the scene. 7 . The method of claim 1 , wherein the scene context data comprises traffic signal context data characterizing at least respective current states of one or more traffic signals in the scene. 8 . The method of claim 1 , wherein generating, for each of the multiple different modalities, a respective sequence of input elements for the modality from the context data for the modality comprises, for each of the modalities: generating an initial sequence of input elements for the modality from the context data for the modality; and processing the initial sequence using an attention neural network that is specific to the modality to generate the sequence of input elements. 9 . The method of claim 1 , wherein generating, for each of the multiple different modalities, a respective sequence of input elements for the modality from the context data for the modality comprises, for each of the modalities: projecting the context data for the modality into a sequence of input elements that each have a dimensionality that is shared across the modalities. 10 . The method of claim 9 , wherein projecting the context data for the modality into a sequence of input elements that each have a dimensionality that is shared across the modalities comprises: projecting the context data for the modality into a sequence of input elements that each have a dimensionality that is shared across the modalities without applying attention over the context data. 11 . The method of claim 9 , wherein generating, for each of the multiple different modalities, a respective sequence of input elements for the modality from the context data for the modality comprises, for each of the modalities: applying positional embedding to each of the input elements. 12 . The method of claim 11 , wherein the context data for each modality is represented as a tensor having a feature dimension, and wherein projecting the context data comprises projecting the feature dimension to have the shared dimensionality. 13 . The method of claim 1 , wherein each input element corresponds to a respective time point along a temporal dimension, and wherein the attention-based encoder neural network comprises one or more temporal cross-modal attention layer blocks that self-attend over input elements corresponding to each of the multiple different modalities along the temporal dimension. 14 . The method of claim 13 , wherein, for each index along the temporal dimension, each temporal cross-modal attention layer block updates the input elements having the index by attending over the input elements having the index. 15 . The method of claim 14 , wherein each input element corresponds to a respective spatial entity along a spatial dimension and wherein the attention-based encoder neural network comprises one or more spatial attention layer blocks that self-attend over input elements along the spatial dimension. 16 . The method of claim 15 , wherein, for each index along the spatial dimension, each spatial cross-modal attention layer block updates the input elements having the index by attending over the input elements having the index. 17 . The method of claim 13 , wherein the encoded representation of the scene in the environment that comprises a respective embedding for each input element in the combined sequence. 18 . The method of claim 1 , wherein the attention-based encoder neural network also receives as input a set of learned queries and comprises: (i) one or more self-attention layer blocks that update the learned queries by applying self-attention over the learned queries, and (ii) one or more cross-attention cross-modal layer blocks that update the learned queries by applying cross-attention between the learned queries and the combined sequence. 19 . The method of claim 18 , wherein the encoded representation of the scene in the environment comprises a respective embedding for each learned query. 20 . The method of claim 1 , further comprising: controlling, by the planning system of the autonomous vehicle, the autonomous vehicle to navigate in the environment based on (i) the trajectory prediction output for the target agent, (ii) data derived from the trajectory prediction output, or (iii) both. 21 . A system comprising: one or more computers; and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: obtaining scene context data characterizing a scene in an environment at a current time point, wherein the scene includes a plurality of agents comprising a target agent and one or more context agents, and the scene context data comprises respective context data for each of multiple different modalities of context data, wherein the scene context data comprises data generated from data captured by one or more sensors of an a
Traffic conditions · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Historical data · CPC title
Road conditions · CPC title
Position · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.