Training a deep neural network model to generate rich object-centric embeddings of robotic vision data
US-2021334599-A1 · Oct 28, 2021 · US
US11731271B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11731271-B2 |
| Application number | US-202016916343-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 30, 2020 |
| Priority date | Jun 30, 2020 |
| Publication date | Aug 22, 2023 |
| Grant date | Aug 22, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Traditionally, robots may learn to perform tasks by observation in clean or sterile environments. However, robots are unable to accurately learn tasks by observation in real environments (e.g., cluttered, noisy, chaotic environments). Methods and systems are provided for teaching robots to learn tasks in real environments based on input (e.g., verbal or textual cues). In particular, a verbal-based Focus-of-Attention (FOA) model receives input, parses the input to recognize at least a task and a target object name. This information is used to spatio-temporally filter a demonstration of the task to allow the robot to focus on the target object and movements associated with the target object within a real environment. In this way, using the verbal-based FOA, a robot is able to recognize “where and when” to pay attention to the demonstration of the task, thereby enabling the robot to learn the task by observation in a real environment.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for teaching a robot a task in a cluttered environment, comprising: receiving an input; parsing the input to identify a task and a target object name; receiving a set of time-series images; detecting a plurality of objects within the set of time-series images, wherein the set of time-series images depicts a demonstration of the task associated with a target object; based on the target object name, identifying the target object among the plurality of objects within the set of time-series images; generating a spatially filtered set of time-series images by spatially filtering the set of time-series images based on the target object; identifying a timing of at least one physical human movement for performing the task associated with the target object within the spatially filtered set of time-series images; generating a spatio-temporal filtered set of time-series images by temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement; and evaluating the spatio-temporal filtered set of time-series images to isolate one or more skill parameters associated with performing the task. 2. The method of claim 1 , wherein the set of time-series images are RGB-D images. 3. The method of claim 1 , wherein spatially filtering the set of time-series images based on the target object further comprises spatially filtering the set of time-series images to identify one or more voxels associated with the target object. 4. The method of claim 1 , further comprising: parsing the input to identify an object attribute; and based on the target object name and the object attribute, identifying the target object within the set of time-series images. 5. The method of claim 1 , wherein temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement further comprises temporally filtering the spatially filtered set of time-series images to identify one or more voxels associated with times in which a human hand approaches or leaves the target object. 6. The method of claim 1 , wherein the at least one physical human movement is associated with one of a grasp task or a release task. 7. The method of claim 1 , wherein the task is a sequence of tasks. 8. The method of claim 1 , further comprising: encoding at least the one or more skill parameters as a task model. 9. The method of claim 8 , further comprising: decoding the task model to calculate one or more motor commands corresponding to at least the one or more skill parameters for performing the task by a robot. 10. A system comprising: at least one processor; and at least one memory communicatively coupled to the at least one processor and having computer-executable instructions stored thereon, the computer-executable instructions when executed by the at least one processor causing the system to: receive a verbal cue; parse the verbal cue to identify a task and a target object name; receive a set of time-series images; detect a plurality of objects within the set of time-series images, wherein the set of time-series images depicts a demonstration of the task associated with a target object; based on the target object name, identify the target object from among the plurality of objects within the set of time-series images; generate a spatially filtered set of time-series images by spatially filtering the set of time-series images based on the target object; identify a timing of at least one physical human movement for performing the task associated with the target object within the spatially filtered set of time-series images; generate a spatio-temporal filtered set of time-series images by temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement; and evaluate the spatio-temporal filtered set of time-series images to identify one or more skill parameters associated with performing the task. 11. The system of claim 10 , wherein the set of time-series images are RGB-D images. 12. The system of claim 10 , wherein spatially filtering the set of time-series images based on the target object further comprises spatially filtering the set of time-series images to identify one or more voxels associated with the target object. 13. The system of claim 10 , wherein temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement further comprises temporally filtering the set of time-series images to identify one or more voxels associated with times in which a human hand approaches or leaves the target object. 14. The system of claim 10 , wherein the at least one physical human movement is associated with one of a grasp task or a release task. 15. A computer-readable storage medium having computer-executable instructions stored thereon, the computer-executable instructions when executed by a processor causing a computer system to: receive an input; parse the input to identify a task and a target object name; receive a set of time-series images; detect a plurality of objects within the set of time-series images, wherein the set of time-series images depicts a demonstration of the task associated with a target object; based on the target object name, identify the target object among the plurality of objects within the set of time-series images; generate a spatially filtered set of time-series images by spatially filtering the set of time-series images based on the target object; identify a timing of at least one physical human movement for performing the task associated with the target object within the spatially filtered set of time-series images; generate a spatio-temporal filtered set of time-series images by temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement; evaluate the spatio-temporal filtered set of time-series images to identify one or more skill parameters associated with performing the task; and encode at least the one or more skill parameters as a task model. 16. The computer-readable storage medium of claim 15 , wherein the set of time-series images are RGB-D images. 17. The computer-readable storage medium of claim 15 , wherein spatially filtering the set of time-series images based on the target object further comprises spatially filtering the set of time-series images to identify one or more voxels associated with the target object. 18. The computer-readable storage medium of claim 15 , wherein temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement further comprises temporally filtering the set of time-series images to identify one or more voxels associated with times in which a human hand approaches or leaves the target object. 19. The computer-readable storage medium of claim 15 , wherein the at least one physical human movement is associated with one of a grasp task or a release task.
characterised by task planning, object-oriented languages · CPC title
Hardware, e.g. neural networks, fuzzy logic, interfaces, processor · CPC title
by means of an audio-responsive input (audible safety signals B25J19/061) · CPC title
Analysis of motion (motion estimation for coding, decoding, compressing or decompressing digital video signals H04N19/43, H04N19/51) · CPC title
Scenes; Scene-specific elements (control of digital cameras H04N23/60) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.