Robotic control using deep learning
US-2021252698-A1 · Aug 19, 2021 · US
US12487581B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12487581-B2 |
| Application number | US-202217697566-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 17, 2022 |
| Priority date | Mar 17, 2022 |
| Publication date | Dec 2, 2025 |
| Grant date | Dec 2, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Approaches provide for performance of a complex (e.g., compound) task that may involve multiple discrete tasks not obvious from an instruction to perform the complex task. A set of conditions for an environment can be determined using captured image data, and the instruction analyzed to determine a set of final conditions to exist in the environment after performance of the instruction. These initial and end conditions are used to determine a sequence of discrete tasks to be performed to cause a robot or automated device to perform the instruction. This can involve use of a symbolic or visual planner in at least some embodiments, as well as a search of possible sequences of actions available for the robot or automated device. A robot can be caused to perform the sequence of discrete tasks, and feedback provided such that the sequence of tasks can be modified as appropriate.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method, comprising: receiving audio data corresponding to a spoken request to perform a compound task, the compound task involving an unspecified plurality of discrete tasks to be performed; analyzing the audio data to generate a textual representation of the spoken request; obtaining image data representing a current state of an environment in which the compound task is to be performed; analyzing the image data to obtain a set of current conditions for the current state of the environment; obtaining a set of expected conditions of the environment by processing the textual representation of the spoken request together with one or more features extracted from the image data representing the current state, wherein the processing includes using natural language understanding on the textual representation of the spoken request and on the one or more features extracted from the image data; determining, based at least in part upon a set of performable actions, a sequence of discrete tasks to be performed to transition from the set of current conditions to the set of expected conditions of the environment; and causing instructions for the sequence of discrete tasks to be executed to perform the compound task. 2 . The method of claim 1 , wherein the compound task is to be performed using a robotic device, and wherein the set of performable actions is determined at least in part using a type of the robotic device. 3 . The method of claim 1 , further comprising: determining the sequence of discrete tasks using a tree-based search, wherein at least one branch of the tree includes a subset of performable actions selected based at least in part upon a respective subset of conditions being satisfied. 4 . The method of claim 3 , wherein the sequence is one of a plurality of candidate sequences selected to minimize a cost of performance of the compound task. 5 . The method of claim 1 , wherein the set of current conditions is determined based upon a set of segmentation masks generated for objects detected in the environment using the image data. 6 . The method of claim 1 , wherein the set of current conditions is determined based at least in part on identifying a set of objects in the environment represented using the image data. 7 . The method of claim 1 , further comprising: monitoring the state of the environment during performance of the compound task; and adjusting the sequence of discrete tasks based at least in part upon a change in the environment. 8 . The method of claim 1 , further comprising: encoding the image data into a latent space to be provided as input to a neural network, wherein the analyzing of the image data comprises analyzing the image data using the neural network to obtain at least one current condition of the set of current conditions for the current state of the environment. 9 . The method of claim 1 , further comprising: generating a set of symbols representative of the set of current conditions and the set of expected conditions of the environment, wherein the sequence of discrete tasks is determined using the set of symbols. 10 . A system, comprising: a language model to convert a spoken instruction into a textual representation of a compound task; an image model to determine a set of visual features corresponding to a current state represented in an image of an environment in which the compound task is to be performed; a natural language understanding module to determine a set of correlations between both the textual representation of the spoken request and the set of visual features corresponding to a current state, the set of correlations being determined using natural language understanding on the textual representation of the spoken request and on the set of visual features; a task planner to determine a sequence of discrete tasks to be performed for the compound task, based on both the set of correlations and a set of performable actions; and an execution module to cause instructions for the sequence of discrete tasks to be performed. 11 . The system of claim 10 , wherein the compound task is to be performed using a robotic device, and wherein the sequence of discrete tasks is determined at least in part by a type corresponding to the robotic device. 12 . The system of claim 10 , wherein the sequence of discrete tasks is determined using a tree-based search, wherein at least one branch of the tree includes a subset of performable actions selected based at least in part upon a respective subset of conditions being satisfied. 13 . The system of claim 12 , wherein the sequence is one of a plurality of candidate sequences selected to minimize a cost of performance of the compound task. 14 . The system of claim 10 , wherein the set of visual features is determined based at least in part on a set of segmentation masks generated for objects detected in the environment using image data from the image of the environment. 15 . The system of claim 10 , further comprising: a performance monitor to monitor the state of the environment, during performance of the compound task, and adjust the sequence of discrete tasks based at least in part upon a change in the environment. 16 . The system of claim 10 , wherein the system comprises at least one of: a system for performing simulation operations; a system for performing simulation operations to test or validate autonomous machine applications; a system for rendering graphical output; a system for generating synthetic data; a system for generating multi-dimensional assets using a collaborative content creation platform; a system for performing deep learning operations; a system implemented using an edge device; a system incorporating one or more Virtual Machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources. 17 . A non-transitory computer-readable storage medium including instructions that, if executed by one or more processors, cause the one or more processors to: analyze audio data to generate a textual representation of a spoken request, the spoken request corresponding to a performance of a compound task represented using the audio data; analyze image data to obtain a set of current conditions for a current state of an environment; obtain a set of expected conditions of the environment by processing the textual representation of the spoken request together with one or more features extracted from the image data representing the current state, wherein the processing includes using natural language understanding on the textual representation of the spoken request and on the one or more features extracted from the image data; determine, based at least in part upon a set of performable actions, a sequence of discrete tasks to be performed to transition from the set of current conditions to the set of expected conditions of the environment; and cause instructions for the sequence of discrete tasks to be executed to perform the compound task. 18 . The non-transitory computer-readable storage medium of claim 17 , wherein the instructions, if executed, further cause the one or more processors to: determine the set of performable actions based upon a type of robotic device to perform the compound task. 19 . The non-transitory computer-readable storage medium of claim 17 , wherein the instructions, if executed, further cause the one or more processors to: determine
characterised by task planning, object-oriented languages · CPC title
Robot · CPC title
by means of sensing devices, e.g. viewing or touching devices · CPC title
Naturally compliant robot arm · CPC title
by means of an audio-responsive input (audible safety signals B25J19/061) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.