Multi-modal sensor based process tracking and guidance

US12469402B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12469402-B2
Application numberUS-202117377152-A
CountryUS
Kind codeB2
Filing dateJul 15, 2021
Priority dateJul 15, 2021
Publication dateNov 11, 2025
Grant dateNov 11, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Examples are disclosed that relate to computer-based tracking of a process performed by a user. In one example, multi-modal sensor information is received via a plurality of sensors. A world state of a real-world physical environment and a user state in the real-world physical environment are tracked based on the multi-modal sensor information. A process being performed by the user within a working domain is recognized based on the world state and the user state. A current step in the process is detected based on the world state and the user state. Domain-specific instructions directing the user how to perform an expected action are presented via a user interface device. A user action is detected based on the world state and the user state. Based on the user action differing from the expected action, domain-specific guidance to perform the expected action is presented via the user interface device.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A method for tracking performance of a process by a user, the method performed by a computing device and comprising receiving, via a plurality of sensors, multi-modal sensor information including image data and audio data; tracking a world state of a real-world physical environment including a plurality of objects based on the multi-modal sensor information by performing object recognition to identify the plurality of objects based at least on the multi-modal sensor information, and by performing object localization and mapping to determine a plurality of positions of the plurality of objects in the environment based at least on the multi-modal sensor information; tracking a user state in the real-world physical environment at least by tracking a pose of the user based on the multi-modal sensor information; generating a multi-modal synchronized state that aligns the world state and the user state in a common coordinate system that is aligned with a frame of reference of the user in a data structure that includes a feature vector that characterizes the world state and the user state in the multi-modal synchronized state; recognizing a process being performed by the user via one or more trained machine learning models that are configured to receive the multi-modal synchronized state including the feature vector as input and determine the process being performed by the user based on the multi-modal synchronized state, the process comprising a series of steps; selecting a working domain for the process being performed by the user, wherein the working domain includes a digital twin model that is a virtual representation that serves as a digital counterpart of the plurality of objects in the environment; aligning the digital twin model with the plurality of objects in the environment; detecting a current step in the process based on the multi-modal synchronized state; presenting, via a display of the computing device, one or more domain-specific instructions including virtually rendered cues that are aligned relative to the digital twin model of the plurality of objects in the environment, wherein the domain-specific instructions direct the user how to perform an expected action to complete the current step in the process; detecting a user action based on the multi-modal synchronized state; and based on the user action differing from the expected action for the current step in the process, presenting, via the display of the computing device, a clarifying question to the user, receiving, via an input subsystem of the computing device, a user response to the clarifying question, selecting domain-specific guidance including additional virtually rendered cues that are aligned relative to the digital twin model of the plurality of objects in the environment based on the user response to the clarifying question, and presenting, via the display of the computing device, the domain-specific guidance including the additional virtually rendered cues directing the user to perform the expected action. 2 . The method of claim 1 , further comprising: based on the user action matching the expected action for the current step in the process, presenting, via the display, one or more additional domain-specific instructions directing the user how to perform a next expected action to complete a next step in the process. 3 . The method of claim 1 , wherein the process includes user manipulation of a real-world object, wherein recognizing the process comprises recognizing the real-world object, and selecting the working domain from a plurality of different working domains corresponding to a plurality of different real-world objects based on the recognized real-world object. 4 . The method of claim 3 , wherein the digital twin model comprises metadata defining a plurality of parts of the real-world object, locations of the plurality of parts, and descriptions of the plurality of parts, and wherein the one or more domain-specific instructions and the domain-specific guidance comprises presenting, via the display, metadata of the digital twin model corresponding to a part of the real-world object involved in the current step in the process. 5 . The method of claim 4 , further comprising: determining that the user action differs from the expected action for the current step in the process based on the multi-modal synchronized state indicating that the user is interacting with a different part of the digital twin model than an expected part for the current step in the process. 6 . The method of claim 3 , wherein the display comprises near-eye display of an augmented-reality device, and wherein the domain-specific guidance comprises visually presenting, via the near-eye display, one or more of a virtual label indicating the part of the real-world object involved in the current step in the process and a virtual movement affordance indicating how to manipulate the part of the real-world object involved in the current step in the process. 7 . The method of claim 1 , wherein the one or more domain-specific instructions and the domain-specific guidance are provided within the frame of reference of the user based on the multi-modal synchronized state. 8 . The method of claim 1 , wherein tracking the user state comprises tracking one or more of a user head pose, one or more of the user's hand poses, and user speech. 9 . The method of claim 1 , wherein the user action is detected via an action-recognition machine-learning model previously trained on multi-modal sensor information corresponding to users performing different domain-specific user actions associated with the series of steps in the process. 10 . A computing system comprising: a plurality of sensors; a display; a processor; and a storage device holding instructions executable by the processor to: receive, via the plurality of sensors, multi-modal sensor information including image data and audio data; track a world state of a real-world physical environment including a plurality of objects based on the multi-modal sensor information by performing object recognition to identify the plurality of objects based at least on the multi-modal sensor information, and by performing object localization and mapping to determine a plurality of positions of the plurality of objects in the environment based at least on the multi-modal sensor information; track a user state in the real-world physical environment at least by tracking a pose of a user based on the multi-modal sensor information; generate a multi-modal synchronized state that aligns the world state and the user state in a common coordinate system that is aligned with a frame of reference of the user in a data structure that includes a feature vector that characterizes the world state and the user state in the multi-modal synchronized state; recognize a process being performed by the user via one or more trained machine learning models that are configured to receive the multi-modal synchronized state including the feature vector as input and determine the process being performed by the user based on the multi-modal synchronized state, the process comprising a series of steps; select a working domain for the process being performed by the user, wherein the working domain includes a digital twin model, the digital twin model comprising a virtual representation and serving as a digital counterpart of the plurality of objects in the environment; align the digital twin model with the plurality of objects in the environment; detect a current step in the process based on the multi-modal synchronized state; present, via the display, one or more domain-specific instructions including virtu

Assignees

Inventors

Classifications

  • Mixed reality (object pose determination, tracking or camera calibration for mixed reality G06T7/00) · CPC title

  • Neural networks · CPC title

  • Audio in a user interface, e.g. using voice commands for navigating, audio feedback · CPC title

  • Gesture based interaction, e.g. based on a set of recognized hand gestures (interaction based on gestures traced on a digitiser G06F3/04883) · CPC title

  • G06F3/011Primary

    Arrangements for interaction with the human body, e.g. for user immersion in virtual reality (blind teaching G09B21/00) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12469402B2 cover?
Examples are disclosed that relate to computer-based tracking of a process performed by a user. In one example, multi-modal sensor information is received via a plurality of sensors. A world state of a real-world physical environment and a user state in the real-world physical environment are tracked based on the multi-modal sensor information. A process being performed by the user within a wor…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F3/011. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).