System for user interactions with an autonomous mobile device
US-11531343-B1 · Dec 20, 2022 · US
US11654573B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11654573-B2 |
| Application number | US-202117168076-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 4, 2021 |
| Priority date | May 29, 2020 |
| Publication date | May 23, 2023 |
| Grant date | May 23, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The disclosure generally relates to methods and systems for enabling human robot interaction by cognition sharing which includes gesture and audio. Conventional techniques that use the gestures and the speech, require extra hardware setup and are limited to navigation in structured outdoor driving environments. The present disclosure herein provides methods and systems that solves the technical problem of enabling the human robot interaction with a two-step approach by transferring the cognitive load from the human to the robot. An accurate shared perspective associated with the task is determined in the first step by computing relative frame transformations based on understanding of navigational gestures of the subject. Then, the shared perspective transformed to the robot in the field view of the robot. The transformed shared perspective is then given to a language grounding technique in the second step, to accurately determine a final goal associated with the task.
Opening claim text (preview).
What is claimed is: 1. A processor-implemented method for enabling a human robot interaction (HRI) by sharing cognition between a subject and a robot, the method comprising the steps of: acquiring, via one or more hardware processors, a visual feed of the subject and an audio feed of the subject, wherein the visual feed and the audio feed are acquired during a directional assistance provided by the subject to the robot, to perform a task, the visual feed comprises one or more visual scenes of the subject while providing the directional assistance, and the audio feed comprises one or more natural language instructions of the subject while providing the directional assistance; estimating, via the one or more hardware processors, a pointing direction of the subject in a field view of the robot, based on a body skeleton based gesture of the subject and a head pose based gaze direction of the subject, using the one or more visual scenes present in the visual feed; estimating, via the one or more hardware processors, an intermediate goal pose for the robot, based on the estimated pointing direction, at a predefined distance from a position of the subject, using a pre-built 2-dimensional (2-D) occupancy map and a scaling factor, wherein the intermediate goal pose comprises a set of coordinates defining a position and an orientation, and wherein the scaling factor is a depth distance between the robot and the subject; generating, via the one or more hardware processors, a trajectory for the robot to reach the estimated intermediate goal pose from a present robot pose, based on a present odometry of the robot, using the pre-built 2-D occupancy map and a robot operating system (ROS) movebase planner, to navigate the robot to reach the estimated intermediate goal pose based on the generated trajectory, wherein the ROS movebase planner is a standard framework for robot navigation; acquiring, via the one or more hardware processors, an intermediate goal image, from a present perspective of the robot after reaching the estimated intermediate goal pose; predicting, via the one or more hardware processors, a matching region associated with the task, based on one or more language features and the intermediate goal image, using a zero-shot single-stage network (ZSGNet) based language grounding technique, wherein the one or more language features are obtained from the audio feed of the subject; and navigating the robot, via the one or more hardware processors, to a final goal point to perform the task, using the ROS movebase planner, wherein the final goal point is determined based on the predicted matching region using the pre-built 2-D occupancy map. 2. Wherein the body skeleton based gesture of the subject is estimated by: predicting a 3-dimensional (3-D) pose of the subject, from the one or more visual scenes present in the visual feed, using a 3-D pose prediction technique; and estimating the body skeleton based gesture of the subject as a first azimuthal angle calculated between a spine and a limb of the subject, based on the predicted 3-D pose. 3. The method of claim 2 , wherein the head pose based gaze direction of the subject is estimated as a second azimuthal angle, from the one or more visual scenes present in the visual feed, using a gaze direction prediction technique. 4. The method of claim 1 , wherein the visual feed of the subject and the intermediate image are acquired through a monocular camera installed in the robot and the audio feed of the subject is acquired through one or more microphones installed in the robot. 5. The method of claim 1 , wherein the zero-shot single-stage network (ZSGNet) is trained with an indoor dataset having a plurality of images of an environment along with object annotations and region descriptions. 6. A system for enabling a human robot interaction (HRI) by sharing cognition between a subject and a robot, the system comprising: a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to: acquire a visual feed of the subject and an audio feed of the subject, wherein the visual feed and the audio feed are acquired during a directional assistance provided by the subject to the robot, to perform a task, the visual feed comprises one or more visual scenes of the subject while providing the directional assistance, and the audio feed comprises one or more natural language instructions of the subject while providing the directional assistance; estimate a pointing direction of the subject in a field view of the robot, based on a body skeleton based gesture of the subject and a head pose based gaze direction of the subject, using the one or more visual scenes present in the visual feed; estimate an intermediate goal pose for the robot, based on the estimated pointing direction, at a predefined distance from a position of the subject, using a pre-built 2-dimensional (2-D) occupancy map and a scaling factor, wherein the intermediate goal pose comprises a set of coordinates defining a position and an orientation, and wherein the scaling factor is a depth distance between the robot and the subject; generate a trajectory for the robot to reach the estimated intermediate goal pose from a present robot pose, based on a present odometry of the robot, using the pre-built 2-D occupancy map and a (robot operating system) ROS movebase planner, to navigate the robot to reach the estimated intermediate goal pose based on the generated trajectory, wherein the ROS movebase planner is a standard framework for robot navigation; acquire an intermediate goal image, from a present perspective of the robot after reaching the estimated intermediate goal pose; predict a matching region associated with the task, based on one or more language features and the intermediate goal image, using a zero-shot single-stage network (ZSGNet) based language grounding technique, wherein the one or more language features are obtained from the audio feed of the subject; and navigate the robot to a final goal point to perform the task, using the ROS movebase planner, wherein the final goal point is determined based on the predicted matching region using the pre-built 2-D occupancy map. 7. The system of claim 6 , wherein the one or more hardware processors are further configured to estimate the body skeleton based gesture of the subject, by: predicting a 3-dimensional (3-D) pose of the subject, from the one or more visual scenes present in the visual feed, using a 3-D pose predicting technique; and estimating the body skeleton based gesture of the subject as a first azimuthal angle calculated between a spine and a limb of the subject, based on the predicted 3-D pose. 8. The system of claim 7 , wherein the one or more hardware processors are further configured to estimate the head pose based gaze direction of the subject, as a second azimuthal angle, from the one or more visual scenes present in the visual feed, using a gaze direction prediction technique. 9. The system of claim 6 , wherein the visual feed of the subject and the intermediate image are acquired through a monocular camera installed in the robot and the audio feed of the subject is acquired through one or more microphones installed in the robot. 10. The system of claim 6 , wherein the zero-shot single-stage network (ZSGNet) is trained with an indoor dataset having a plurality of images of an environment along with object annotations and region descriptions. 11. A computer program product comprising a non-transitory computer readable medium having a computer re
Hardware, e.g. neural networks, fuzzy logic, interfaces, processor · CPC title
with position, velocity or acceleration sensors · CPC title
Controls for manipulators (programme controls B25J9/16) · CPC title
Vision controlled systems · CPC title
characterised by programming language · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.