Verbal-based focus-of-attention task model encoder

US11731271B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11731271-B2
Application numberUS-202016916343-A
CountryUS
Kind codeB2
Filing dateJun 30, 2020
Priority dateJun 30, 2020
Publication dateAug 22, 2023
Grant dateAug 22, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Traditionally, robots may learn to perform tasks by observation in clean or sterile environments. However, robots are unable to accurately learn tasks by observation in real environments (e.g., cluttered, noisy, chaotic environments). Methods and systems are provided for teaching robots to learn tasks in real environments based on input (e.g., verbal or textual cues). In particular, a verbal-based Focus-of-Attention (FOA) model receives input, parses the input to recognize at least a task and a target object name. This information is used to spatio-temporally filter a demonstration of the task to allow the robot to focus on the target object and movements associated with the target object within a real environment. In this way, using the verbal-based FOA, a robot is able to recognize “where and when” to pay attention to the demonstration of the task, thereby enabling the robot to learn the task by observation in a real environment.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for teaching a robot a task in a cluttered environment, comprising: receiving an input; parsing the input to identify a task and a target object name; receiving a set of time-series images; detecting a plurality of objects within the set of time-series images, wherein the set of time-series images depicts a demonstration of the task associated with a target object; based on the target object name, identifying the target object among the plurality of objects within the set of time-series images; generating a spatially filtered set of time-series images by spatially filtering the set of time-series images based on the target object; identifying a timing of at least one physical human movement for performing the task associated with the target object within the spatially filtered set of time-series images; generating a spatio-temporal filtered set of time-series images by temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement; and evaluating the spatio-temporal filtered set of time-series images to isolate one or more skill parameters associated with performing the task. 2. The method of claim 1 , wherein the set of time-series images are RGB-D images. 3. The method of claim 1 , wherein spatially filtering the set of time-series images based on the target object further comprises spatially filtering the set of time-series images to identify one or more voxels associated with the target object. 4. The method of claim 1 , further comprising: parsing the input to identify an object attribute; and based on the target object name and the object attribute, identifying the target object within the set of time-series images. 5. The method of claim 1 , wherein temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement further comprises temporally filtering the spatially filtered set of time-series images to identify one or more voxels associated with times in which a human hand approaches or leaves the target object. 6. The method of claim 1 , wherein the at least one physical human movement is associated with one of a grasp task or a release task. 7. The method of claim 1 , wherein the task is a sequence of tasks. 8. The method of claim 1 , further comprising: encoding at least the one or more skill parameters as a task model. 9. The method of claim 8 , further comprising: decoding the task model to calculate one or more motor commands corresponding to at least the one or more skill parameters for performing the task by a robot. 10. A system comprising: at least one processor; and at least one memory communicatively coupled to the at least one processor and having computer-executable instructions stored thereon, the computer-executable instructions when executed by the at least one processor causing the system to: receive a verbal cue; parse the verbal cue to identify a task and a target object name; receive a set of time-series images; detect a plurality of objects within the set of time-series images, wherein the set of time-series images depicts a demonstration of the task associated with a target object; based on the target object name, identify the target object from among the plurality of objects within the set of time-series images; generate a spatially filtered set of time-series images by spatially filtering the set of time-series images based on the target object; identify a timing of at least one physical human movement for performing the task associated with the target object within the spatially filtered set of time-series images; generate a spatio-temporal filtered set of time-series images by temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement; and evaluate the spatio-temporal filtered set of time-series images to identify one or more skill parameters associated with performing the task. 11. The system of claim 10 , wherein the set of time-series images are RGB-D images. 12. The system of claim 10 , wherein spatially filtering the set of time-series images based on the target object further comprises spatially filtering the set of time-series images to identify one or more voxels associated with the target object. 13. The system of claim 10 , wherein temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement further comprises temporally filtering the set of time-series images to identify one or more voxels associated with times in which a human hand approaches or leaves the target object. 14. The system of claim 10 , wherein the at least one physical human movement is associated with one of a grasp task or a release task. 15. A computer-readable storage medium having computer-executable instructions stored thereon, the computer-executable instructions when executed by a processor causing a computer system to: receive an input; parse the input to identify a task and a target object name; receive a set of time-series images; detect a plurality of objects within the set of time-series images, wherein the set of time-series images depicts a demonstration of the task associated with a target object; based on the target object name, identify the target object among the plurality of objects within the set of time-series images; generate a spatially filtered set of time-series images by spatially filtering the set of time-series images based on the target object; identify a timing of at least one physical human movement for performing the task associated with the target object within the spatially filtered set of time-series images; generate a spatio-temporal filtered set of time-series images by temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement; evaluate the spatio-temporal filtered set of time-series images to identify one or more skill parameters associated with performing the task; and encode at least the one or more skill parameters as a task model. 16. The computer-readable storage medium of claim 15 , wherein the set of time-series images are RGB-D images. 17. The computer-readable storage medium of claim 15 , wherein spatially filtering the set of time-series images based on the target object further comprises spatially filtering the set of time-series images to identify one or more voxels associated with the target object. 18. The computer-readable storage medium of claim 15 , wherein temporally filtering the spatially filtered set of time-series images based on the timing of the at least one physical human movement further comprises temporally filtering the set of time-series images to identify one or more voxels associated with times in which a human hand approaches or leaves the target object. 19. The computer-readable storage medium of claim 15 , wherein the at least one physical human movement is associated with one of a grasp task or a release task.

Assignees

Inventors

Classifications

  • B25J9/1661Primary

    characterised by task planning, object-oriented languages · CPC title

  • Hardware, e.g. neural networks, fuzzy logic, interfaces, processor · CPC title

  • by means of an audio-responsive input (audible safety signals B25J19/061) · CPC title

  • Analysis of motion (motion estimation for coding, decoding, compressing or decompressing digital video signals H04N19/43, H04N19/51) · CPC title

  • Scenes; Scene-specific elements (control of digital cameras H04N23/60) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11731271B2 cover?
Traditionally, robots may learn to perform tasks by observation in clean or sterile environments. However, robots are unable to accurately learn tasks by observation in real environments (e.g., cluttered, noisy, chaotic environments). Methods and systems are provided for teaching robots to learn tasks in real environments based on input (e.g., verbal or textual cues). In particular, a verbal-ba…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification B25J9/1661. Mapped technology areas include Operations & Transport.
When was this patent published?
Publication date Tue Aug 22 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).