Methods and systems for enabling human robot interaction by sharing cognition

US11654573B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11654573-B2
Application numberUS-202117168076-A
CountryUS
Kind codeB2
Filing dateFeb 4, 2021
Priority dateMay 29, 2020
Publication dateMay 23, 2023
Grant dateMay 23, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosure generally relates to methods and systems for enabling human robot interaction by cognition sharing which includes gesture and audio. Conventional techniques that use the gestures and the speech, require extra hardware setup and are limited to navigation in structured outdoor driving environments. The present disclosure herein provides methods and systems that solves the technical problem of enabling the human robot interaction with a two-step approach by transferring the cognitive load from the human to the robot. An accurate shared perspective associated with the task is determined in the first step by computing relative frame transformations based on understanding of navigational gestures of the subject. Then, the shared perspective transformed to the robot in the field view of the robot. The transformed shared perspective is then given to a language grounding technique in the second step, to accurately determine a final goal associated with the task.

First claim

Opening claim text (preview).

What is claimed is: 1. A processor-implemented method for enabling a human robot interaction (HRI) by sharing cognition between a subject and a robot, the method comprising the steps of: acquiring, via one or more hardware processors, a visual feed of the subject and an audio feed of the subject, wherein the visual feed and the audio feed are acquired during a directional assistance provided by the subject to the robot, to perform a task, the visual feed comprises one or more visual scenes of the subject while providing the directional assistance, and the audio feed comprises one or more natural language instructions of the subject while providing the directional assistance; estimating, via the one or more hardware processors, a pointing direction of the subject in a field view of the robot, based on a body skeleton based gesture of the subject and a head pose based gaze direction of the subject, using the one or more visual scenes present in the visual feed; estimating, via the one or more hardware processors, an intermediate goal pose for the robot, based on the estimated pointing direction, at a predefined distance from a position of the subject, using a pre-built 2-dimensional (2-D) occupancy map and a scaling factor, wherein the intermediate goal pose comprises a set of coordinates defining a position and an orientation, and wherein the scaling factor is a depth distance between the robot and the subject; generating, via the one or more hardware processors, a trajectory for the robot to reach the estimated intermediate goal pose from a present robot pose, based on a present odometry of the robot, using the pre-built 2-D occupancy map and a robot operating system (ROS) movebase planner, to navigate the robot to reach the estimated intermediate goal pose based on the generated trajectory, wherein the ROS movebase planner is a standard framework for robot navigation; acquiring, via the one or more hardware processors, an intermediate goal image, from a present perspective of the robot after reaching the estimated intermediate goal pose; predicting, via the one or more hardware processors, a matching region associated with the task, based on one or more language features and the intermediate goal image, using a zero-shot single-stage network (ZSGNet) based language grounding technique, wherein the one or more language features are obtained from the audio feed of the subject; and navigating the robot, via the one or more hardware processors, to a final goal point to perform the task, using the ROS movebase planner, wherein the final goal point is determined based on the predicted matching region using the pre-built 2-D occupancy map. 2. Wherein the body skeleton based gesture of the subject is estimated by: predicting a 3-dimensional (3-D) pose of the subject, from the one or more visual scenes present in the visual feed, using a 3-D pose prediction technique; and estimating the body skeleton based gesture of the subject as a first azimuthal angle calculated between a spine and a limb of the subject, based on the predicted 3-D pose. 3. The method of claim 2 , wherein the head pose based gaze direction of the subject is estimated as a second azimuthal angle, from the one or more visual scenes present in the visual feed, using a gaze direction prediction technique. 4. The method of claim 1 , wherein the visual feed of the subject and the intermediate image are acquired through a monocular camera installed in the robot and the audio feed of the subject is acquired through one or more microphones installed in the robot. 5. The method of claim 1 , wherein the zero-shot single-stage network (ZSGNet) is trained with an indoor dataset having a plurality of images of an environment along with object annotations and region descriptions. 6. A system for enabling a human robot interaction (HRI) by sharing cognition between a subject and a robot, the system comprising: a memory storing instructions; one or more Input/Output (I/O) interfaces; and one or more hardware processors coupled to the memory via the one or more I/O interfaces, wherein the one or more hardware processors are configured by the instructions to: acquire a visual feed of the subject and an audio feed of the subject, wherein the visual feed and the audio feed are acquired during a directional assistance provided by the subject to the robot, to perform a task, the visual feed comprises one or more visual scenes of the subject while providing the directional assistance, and the audio feed comprises one or more natural language instructions of the subject while providing the directional assistance; estimate a pointing direction of the subject in a field view of the robot, based on a body skeleton based gesture of the subject and a head pose based gaze direction of the subject, using the one or more visual scenes present in the visual feed; estimate an intermediate goal pose for the robot, based on the estimated pointing direction, at a predefined distance from a position of the subject, using a pre-built 2-dimensional (2-D) occupancy map and a scaling factor, wherein the intermediate goal pose comprises a set of coordinates defining a position and an orientation, and wherein the scaling factor is a depth distance between the robot and the subject; generate a trajectory for the robot to reach the estimated intermediate goal pose from a present robot pose, based on a present odometry of the robot, using the pre-built 2-D occupancy map and a (robot operating system) ROS movebase planner, to navigate the robot to reach the estimated intermediate goal pose based on the generated trajectory, wherein the ROS movebase planner is a standard framework for robot navigation; acquire an intermediate goal image, from a present perspective of the robot after reaching the estimated intermediate goal pose; predict a matching region associated with the task, based on one or more language features and the intermediate goal image, using a zero-shot single-stage network (ZSGNet) based language grounding technique, wherein the one or more language features are obtained from the audio feed of the subject; and navigate the robot to a final goal point to perform the task, using the ROS movebase planner, wherein the final goal point is determined based on the predicted matching region using the pre-built 2-D occupancy map. 7. The system of claim 6 , wherein the one or more hardware processors are further configured to estimate the body skeleton based gesture of the subject, by: predicting a 3-dimensional (3-D) pose of the subject, from the one or more visual scenes present in the visual feed, using a 3-D pose predicting technique; and estimating the body skeleton based gesture of the subject as a first azimuthal angle calculated between a spine and a limb of the subject, based on the predicted 3-D pose. 8. The system of claim 7 , wherein the one or more hardware processors are further configured to estimate the head pose based gaze direction of the subject, as a second azimuthal angle, from the one or more visual scenes present in the visual feed, using a gaze direction prediction technique. 9. The system of claim 6 , wherein the visual feed of the subject and the intermediate image are acquired through a monocular camera installed in the robot and the audio feed of the subject is acquired through one or more microphones installed in the robot. 10. The system of claim 6 , wherein the zero-shot single-stage network (ZSGNet) is trained with an indoor dataset having a plurality of images of an environment along with object annotations and region descriptions. 11. A computer program product comprising a non-transitory computer readable medium having a computer re

Assignees

Inventors

Classifications

  • Hardware, e.g. neural networks, fuzzy logic, interfaces, processor · CPC title

  • with position, velocity or acceleration sensors · CPC title

  • Controls for manipulators (programme controls B25J9/16) · CPC title

  • B25J9/1697Primary

    Vision controlled systems · CPC title

  • characterised by programming language · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11654573B2 cover?
The disclosure generally relates to methods and systems for enabling human robot interaction by cognition sharing which includes gesture and audio. Conventional techniques that use the gestures and the speech, require extra hardware setup and are limited to navigation in structured outdoor driving environments. The present disclosure herein provides methods and systems that solves the technical…
Who is the assignee on this patent?
Tata Consultancy Services Ltd
What technology area does this patent fall under?
Primary CPC classification B25J9/1697. Mapped technology areas include Operations & Transport.
When was this patent published?
Publication date Tue May 23 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).