Knowledge-driven scene priors for semantic audio-visual embodied navigation

US12586393B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12586393-B2
Application numberUS-202318222090-A
CountryUS
Kind codeB2
Filing dateJul 14, 2023
Priority dateJul 14, 2023
Publication dateMar 24, 2026
Grant dateMar 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method of controlling navigation of a device in an environment using machine learning (ML) models includes receiving visual and audio observation data of the environment as sensed by the device, determining classification scores for objects and regions in the environment based on the visual and audio observation data, encoding visual information based on the classification scores, determining audio-semantic feature embeddings based at least in part on the classification scores, the audio-semantic feature embeddings indicating spatial relationships between objects in the environment, between regions in the environment, and between objects and regions in the environment, and determining and outputting, based on the encoded visual information and the audio-semantic feature embeddings, a state representation corresponding to a state of the device within the environment.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of controlling navigation of a device in an environment using machine learning (ML) models, the method comprising, using one or more processing devices: receiving visual and audio observation data of the environment as sensed by the device; determining classification scores for objects and regions in the environment based on the visual and audio observation data; encoding visual information based on the classification scores; determining (i) a distance and direction of a sounding object from the device based on the audio observation data and (ii) a direct-to-reverberant ratio (DRR) of an impulse sounding response between the sounding object and the device; determining audio-semantic feature embeddings based at least in part on the classification scores, the determined distance and direction of the sounding object from the device, and the DRR, wherein the audio-semantic feature embeddings indicate spatial relationships between objects in the environment, between regions in the environment, and between objects and regions in the environment; determining and outputting, based on the encoded visual information and the audio-semantic feature embeddings, a state representation corresponding to a state of the device within the environment; and controlling operation of the device based on the state representation. 2 . The method of claim 1 , wherein determining the audio-semantic feature embeddings includes determining the audio-semantic feature embeddings using a first graph encoder network. 3 . The method of claim 2 , wherein the first graph encoder network determines the audio-semantic feature embeddings using a first knowledge graph, and wherein vertices in the first knowledge graph correspond to objects or regions in the environment and edges between respective pairs of vertices correspond to relationships between the respective pairs of vertices. 4 . The method of claim 3 , further comprising determining visual-semantic feature embeddings based at least in part on the classification scores, wherein determining the visual-semantic feature embeddings further indicate spatial relationships between objects in the environment, between regions in the environment, and between objects and regions in the environment. 5 . The method of claim 4 , wherein determining the visual-semantic feature embeddings includes determining the visual-semantic feature embeddings using a second graph encoder network. 6 . The method of claim 5 , further comprising encoding the visual information based on an output of the second graph encoder network and an output of an audio encoder. 7 . The method of claim 1 , wherein determining the classification scores for objects and regions in the environment based on the visual and audio observation data includes (i) determining visual classification scores using a pre-trained vision model and (ii) determining audio classification scores using a pre-trained audio model. 8 . The method of claim 1 , further comprising providing, during training of the ML model, visual and audio data corresponding to (i) previously seen indoor environments and previously heard sounds, (ii) previously seen indoor environments and unheard sounds, (iii) unseen houses and previously heard sounds, and (iv) unseen houses and unheard sounds. 9 . A system for controlling navigation of a device in an environment using machine learning (ML) models, the system comprising: sensors configured to receive visual and audio observation data of the environment; a vision network configured to determine visual classification scores for objects and regions in the environment based on the visual observation data; a location predictor configured to determine (i) a distance and direction of a sounding object from the device based on the audio observation data and (ii) a direct-to-reverberant ratio (DRR) of an impulse sounding response between the sounding object and the device; an audio network configured to (i) determine audio classification scores for objects and regions in the environment based on the audio observation data and (ii) determine audio-semantic feature embeddings based at least in part on the visual classification scores, the determined distance and direction of the sounding object from the device, and the DRR, wherein the audio-semantic feature embeddings indicate spatial relationships between objects in the environment, between regions in the environment, and between objects and regions in the environment; a policy network configured to (i) encode visual information based on the visual classification scores and (ii) determine and output, based on the encoded visual information and the audio-semantic feature embeddings, a state representation corresponding to a state of the device within the environment; and one or more processing devices configured to control operation of the device based on the state representation. 10 . The system of claim 9 , wherein the audio network includes a first graph encoder network configured to determine the audio-semantic feature embeddings using a first knowledge graph, and wherein vertices in the first knowledge graph correspond to objects or regions in the environment and edges between respective pairs of vertices correspond to relationships between the respective pairs of vertices. 11 . The system of claim 10 , wherein the vision network includes a second graph encoder network configured to determine visual-semantic feature embeddings based at least in part on the visual classification scores, and wherein the policy network includes an encoder configured to encode the visual information based on an output of the second graph encoder network and an output of an audio encoder. 12 . The system of claim 9 , wherein the vision network and the audio network are configured to respectively implement a pre-trained vision model and a pre-trained audio model to determine the visual and audio classification scores, and wherein the pre-trained vision model and the pre-trained audio model are configured to receive, during training, visual and audio data corresponding to (i) previously seen indoor environments and previously heard sounds, (ii) previously seen indoor environments and unheard sounds, (iii) unseen houses and previously heard sounds, and (iv) unseen houses and unheard sounds. 13 . A computing device configured to control navigation of a device in an environment using machine learning (ML) models, the computing device including a processing device configured to execute instructions stored in memory to: receive visual and audio observation data of the environment as sensed by the device; determine classification scores for objects and regions in the environment based on the visual and audio observation data; encode visual information based on the classification scores; determine (i) a distance and direction of a sounding object from the device based on the audio observation data and (ii) a direct-to-reverberant ratio (DRR) of an impulse sounding response between the sounding object and the device; determine audio-semantic feature embeddings based at least in part on the classification scores, the determined distance and direction of the sounding object from the device, and the DRR, wherein the audio-semantic feature embeddings indicate spatial relationships between objects in the environment, between regions in the environment, and between objects and regions in the environment; determine and output, based on the encoded visual information and the audio-semantic feature embeddings, a state representation corresponding to a state of the device within the environment; and control

Assignees

Inventors

Classifications

  • the classifiers operating on different input data, e.g. multi-modal recognition · CPC title

  • Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title

  • Semantic analysis · CPC title

  • using neural networks · CPC title

  • G06V20/70Primary

    Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12586393B2 cover?
A method of controlling navigation of a device in an environment using machine learning (ML) models includes receiving visual and audio observation data of the environment as sensed by the device, determining classification scores for objects and regions in the environment based on the visual and audio observation data, encoding visual information based on the classification scores, determining…
Who is the assignee on this patent?
Bosch Gmbh Robert, Univ Carnegie Mellon
What technology area does this patent fall under?
Primary CPC classification G06V10/7715. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).