Video scene describer
US-2025246176-A1 · Jul 31, 2025 · US
US12586393B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12586393-B2 |
| Application number | US-202318222090-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 14, 2023 |
| Priority date | Jul 14, 2023 |
| Publication date | Mar 24, 2026 |
| Grant date | Mar 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method of controlling navigation of a device in an environment using machine learning (ML) models includes receiving visual and audio observation data of the environment as sensed by the device, determining classification scores for objects and regions in the environment based on the visual and audio observation data, encoding visual information based on the classification scores, determining audio-semantic feature embeddings based at least in part on the classification scores, the audio-semantic feature embeddings indicating spatial relationships between objects in the environment, between regions in the environment, and between objects and regions in the environment, and determining and outputting, based on the encoded visual information and the audio-semantic feature embeddings, a state representation corresponding to a state of the device within the environment.
Opening claim text (preview).
What is claimed is: 1 . A method of controlling navigation of a device in an environment using machine learning (ML) models, the method comprising, using one or more processing devices: receiving visual and audio observation data of the environment as sensed by the device; determining classification scores for objects and regions in the environment based on the visual and audio observation data; encoding visual information based on the classification scores; determining (i) a distance and direction of a sounding object from the device based on the audio observation data and (ii) a direct-to-reverberant ratio (DRR) of an impulse sounding response between the sounding object and the device; determining audio-semantic feature embeddings based at least in part on the classification scores, the determined distance and direction of the sounding object from the device, and the DRR, wherein the audio-semantic feature embeddings indicate spatial relationships between objects in the environment, between regions in the environment, and between objects and regions in the environment; determining and outputting, based on the encoded visual information and the audio-semantic feature embeddings, a state representation corresponding to a state of the device within the environment; and controlling operation of the device based on the state representation. 2 . The method of claim 1 , wherein determining the audio-semantic feature embeddings includes determining the audio-semantic feature embeddings using a first graph encoder network. 3 . The method of claim 2 , wherein the first graph encoder network determines the audio-semantic feature embeddings using a first knowledge graph, and wherein vertices in the first knowledge graph correspond to objects or regions in the environment and edges between respective pairs of vertices correspond to relationships between the respective pairs of vertices. 4 . The method of claim 3 , further comprising determining visual-semantic feature embeddings based at least in part on the classification scores, wherein determining the visual-semantic feature embeddings further indicate spatial relationships between objects in the environment, between regions in the environment, and between objects and regions in the environment. 5 . The method of claim 4 , wherein determining the visual-semantic feature embeddings includes determining the visual-semantic feature embeddings using a second graph encoder network. 6 . The method of claim 5 , further comprising encoding the visual information based on an output of the second graph encoder network and an output of an audio encoder. 7 . The method of claim 1 , wherein determining the classification scores for objects and regions in the environment based on the visual and audio observation data includes (i) determining visual classification scores using a pre-trained vision model and (ii) determining audio classification scores using a pre-trained audio model. 8 . The method of claim 1 , further comprising providing, during training of the ML model, visual and audio data corresponding to (i) previously seen indoor environments and previously heard sounds, (ii) previously seen indoor environments and unheard sounds, (iii) unseen houses and previously heard sounds, and (iv) unseen houses and unheard sounds. 9 . A system for controlling navigation of a device in an environment using machine learning (ML) models, the system comprising: sensors configured to receive visual and audio observation data of the environment; a vision network configured to determine visual classification scores for objects and regions in the environment based on the visual observation data; a location predictor configured to determine (i) a distance and direction of a sounding object from the device based on the audio observation data and (ii) a direct-to-reverberant ratio (DRR) of an impulse sounding response between the sounding object and the device; an audio network configured to (i) determine audio classification scores for objects and regions in the environment based on the audio observation data and (ii) determine audio-semantic feature embeddings based at least in part on the visual classification scores, the determined distance and direction of the sounding object from the device, and the DRR, wherein the audio-semantic feature embeddings indicate spatial relationships between objects in the environment, between regions in the environment, and between objects and regions in the environment; a policy network configured to (i) encode visual information based on the visual classification scores and (ii) determine and output, based on the encoded visual information and the audio-semantic feature embeddings, a state representation corresponding to a state of the device within the environment; and one or more processing devices configured to control operation of the device based on the state representation. 10 . The system of claim 9 , wherein the audio network includes a first graph encoder network configured to determine the audio-semantic feature embeddings using a first knowledge graph, and wherein vertices in the first knowledge graph correspond to objects or regions in the environment and edges between respective pairs of vertices correspond to relationships between the respective pairs of vertices. 11 . The system of claim 10 , wherein the vision network includes a second graph encoder network configured to determine visual-semantic feature embeddings based at least in part on the visual classification scores, and wherein the policy network includes an encoder configured to encode the visual information based on an output of the second graph encoder network and an output of an audio encoder. 12 . The system of claim 9 , wherein the vision network and the audio network are configured to respectively implement a pre-trained vision model and a pre-trained audio model to determine the visual and audio classification scores, and wherein the pre-trained vision model and the pre-trained audio model are configured to receive, during training, visual and audio data corresponding to (i) previously seen indoor environments and previously heard sounds, (ii) previously seen indoor environments and unheard sounds, (iii) unseen houses and previously heard sounds, and (iv) unseen houses and unheard sounds. 13 . A computing device configured to control navigation of a device in an environment using machine learning (ML) models, the computing device including a processing device configured to execute instructions stored in memory to: receive visual and audio observation data of the environment as sensed by the device; determine classification scores for objects and regions in the environment based on the visual and audio observation data; encode visual information based on the classification scores; determine (i) a distance and direction of a sounding object from the device based on the audio observation data and (ii) a direct-to-reverberant ratio (DRR) of an impulse sounding response between the sounding object and the device; determine audio-semantic feature embeddings based at least in part on the classification scores, the determined distance and direction of the sounding object from the device, and the DRR, wherein the audio-semantic feature embeddings indicate spatial relationships between objects in the environment, between regions in the environment, and between objects and regions in the environment; determine and output, based on the encoded visual information and the audio-semantic feature embeddings, a state representation corresponding to a state of the device within the environment; and control
the classifiers operating on different input data, e.g. multi-modal recognition · CPC title
Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods · CPC title
Semantic analysis · CPC title
using neural networks · CPC title
Labelling scene content, e.g. deriving syntactic or semantic representations · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.