Interpreting discrete tasks from complex instructions for robotic systems and applications

US12487581B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12487581-B2
Application numberUS-202217697566-A
CountryUS
Kind codeB2
Filing dateMar 17, 2022
Priority dateMar 17, 2022
Publication dateDec 2, 2025
Grant dateDec 2, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Approaches provide for performance of a complex (e.g., compound) task that may involve multiple discrete tasks not obvious from an instruction to perform the complex task. A set of conditions for an environment can be determined using captured image data, and the instruction analyzed to determine a set of final conditions to exist in the environment after performance of the instruction. These initial and end conditions are used to determine a sequence of discrete tasks to be performed to cause a robot or automated device to perform the instruction. This can involve use of a symbolic or visual planner in at least some embodiments, as well as a search of possible sequences of actions available for the robot or automated device. A robot can be caused to perform the sequence of discrete tasks, and feedback provided such that the sequence of tasks can be modified as appropriate.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method, comprising: receiving audio data corresponding to a spoken request to perform a compound task, the compound task involving an unspecified plurality of discrete tasks to be performed; analyzing the audio data to generate a textual representation of the spoken request; obtaining image data representing a current state of an environment in which the compound task is to be performed; analyzing the image data to obtain a set of current conditions for the current state of the environment; obtaining a set of expected conditions of the environment by processing the textual representation of the spoken request together with one or more features extracted from the image data representing the current state, wherein the processing includes using natural language understanding on the textual representation of the spoken request and on the one or more features extracted from the image data; determining, based at least in part upon a set of performable actions, a sequence of discrete tasks to be performed to transition from the set of current conditions to the set of expected conditions of the environment; and causing instructions for the sequence of discrete tasks to be executed to perform the compound task. 2 . The method of claim 1 , wherein the compound task is to be performed using a robotic device, and wherein the set of performable actions is determined at least in part using a type of the robotic device. 3 . The method of claim 1 , further comprising: determining the sequence of discrete tasks using a tree-based search, wherein at least one branch of the tree includes a subset of performable actions selected based at least in part upon a respective subset of conditions being satisfied. 4 . The method of claim 3 , wherein the sequence is one of a plurality of candidate sequences selected to minimize a cost of performance of the compound task. 5 . The method of claim 1 , wherein the set of current conditions is determined based upon a set of segmentation masks generated for objects detected in the environment using the image data. 6 . The method of claim 1 , wherein the set of current conditions is determined based at least in part on identifying a set of objects in the environment represented using the image data. 7 . The method of claim 1 , further comprising: monitoring the state of the environment during performance of the compound task; and adjusting the sequence of discrete tasks based at least in part upon a change in the environment. 8 . The method of claim 1 , further comprising: encoding the image data into a latent space to be provided as input to a neural network, wherein the analyzing of the image data comprises analyzing the image data using the neural network to obtain at least one current condition of the set of current conditions for the current state of the environment. 9 . The method of claim 1 , further comprising: generating a set of symbols representative of the set of current conditions and the set of expected conditions of the environment, wherein the sequence of discrete tasks is determined using the set of symbols. 10 . A system, comprising: a language model to convert a spoken instruction into a textual representation of a compound task; an image model to determine a set of visual features corresponding to a current state represented in an image of an environment in which the compound task is to be performed; a natural language understanding module to determine a set of correlations between both the textual representation of the spoken request and the set of visual features corresponding to a current state, the set of correlations being determined using natural language understanding on the textual representation of the spoken request and on the set of visual features; a task planner to determine a sequence of discrete tasks to be performed for the compound task, based on both the set of correlations and a set of performable actions; and an execution module to cause instructions for the sequence of discrete tasks to be performed. 11 . The system of claim 10 , wherein the compound task is to be performed using a robotic device, and wherein the sequence of discrete tasks is determined at least in part by a type corresponding to the robotic device. 12 . The system of claim 10 , wherein the sequence of discrete tasks is determined using a tree-based search, wherein at least one branch of the tree includes a subset of performable actions selected based at least in part upon a respective subset of conditions being satisfied. 13 . The system of claim 12 , wherein the sequence is one of a plurality of candidate sequences selected to minimize a cost of performance of the compound task. 14 . The system of claim 10 , wherein the set of visual features is determined based at least in part on a set of segmentation masks generated for objects detected in the environment using image data from the image of the environment. 15 . The system of claim 10 , further comprising: a performance monitor to monitor the state of the environment, during performance of the compound task, and adjust the sequence of discrete tasks based at least in part upon a change in the environment. 16 . The system of claim 10 , wherein the system comprises at least one of: a system for performing simulation operations; a system for performing simulation operations to test or validate autonomous machine applications; a system for rendering graphical output; a system for generating synthetic data; a system for generating multi-dimensional assets using a collaborative content creation platform; a system for performing deep learning operations; a system implemented using an edge device; a system incorporating one or more Virtual Machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 17 . A non-transitory computer-readable storage medium including instructions that, if executed by one or more processors, cause the one or more processors to: analyze audio data to generate a textual representation of a spoken request, the spoken request corresponding to a performance of a compound task represented using the audio data; analyze image data to obtain a set of current conditions for a current state of an environment; obtain a set of expected conditions of the environment by processing the textual representation of the spoken request together with one or more features extracted from the image data representing the current state, wherein the processing includes using natural language understanding on the textual representation of the spoken request and on the one or more features extracted from the image data; determine, based at least in part upon a set of performable actions, a sequence of discrete tasks to be performed to transition from the set of current conditions to the set of expected conditions of the environment; and cause instructions for the sequence of discrete tasks to be executed to perform the compound task. 18 . The non-transitory computer-readable storage medium of claim 17 , wherein the instructions, if executed, further cause the one or more processors to: determine the set of performable actions based upon a type of robotic device to perform the compound task. 19 . The non-transitory computer-readable storage medium of claim 17 , wherein the instructions, if executed, further cause the one or more processors to: determine

Assignees

Inventors

Classifications

  • B25J9/1661Primary

    characterised by task planning, object-oriented languages · CPC title

  • Robot · CPC title

  • by means of sensing devices, e.g. viewing or touching devices · CPC title

  • Naturally compliant robot arm · CPC title

  • by means of an audio-responsive input (audible safety signals B25J19/061) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12487581B2 cover?
Approaches provide for performance of a complex (e.g., compound) task that may involve multiple discrete tasks not obvious from an instruction to perform the complex task. A set of conditions for an environment can be determined using captured image data, and the instruction analyzed to determine a set of final conditions to exist in the environment after performance of the instruction. These i…
Who is the assignee on this patent?
Nvidia Corp
What technology area does this patent fall under?
Primary CPC classification B25J9/1661. Mapped technology areas include Operations & Transport.
When was this patent published?
Publication date Tue Dec 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).