Virtual reality and augmented reality for industrial automation
US-2020336707-A1 · Oct 22, 2020 · US
US12469402B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12469402-B2 |
| Application number | US-202117377152-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 15, 2021 |
| Priority date | Jul 15, 2021 |
| Publication date | Nov 11, 2025 |
| Grant date | Nov 11, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Examples are disclosed that relate to computer-based tracking of a process performed by a user. In one example, multi-modal sensor information is received via a plurality of sensors. A world state of a real-world physical environment and a user state in the real-world physical environment are tracked based on the multi-modal sensor information. A process being performed by the user within a working domain is recognized based on the world state and the user state. A current step in the process is detected based on the world state and the user state. Domain-specific instructions directing the user how to perform an expected action are presented via a user interface device. A user action is detected based on the world state and the user state. Based on the user action differing from the expected action, domain-specific guidance to perform the expected action is presented via the user interface device.
Opening claim text (preview).
The invention claimed is: 1 . A method for tracking performance of a process by a user, the method performed by a computing device and comprising receiving, via a plurality of sensors, multi-modal sensor information including image data and audio data; tracking a world state of a real-world physical environment including a plurality of objects based on the multi-modal sensor information by performing object recognition to identify the plurality of objects based at least on the multi-modal sensor information, and by performing object localization and mapping to determine a plurality of positions of the plurality of objects in the environment based at least on the multi-modal sensor information; tracking a user state in the real-world physical environment at least by tracking a pose of the user based on the multi-modal sensor information; generating a multi-modal synchronized state that aligns the world state and the user state in a common coordinate system that is aligned with a frame of reference of the user in a data structure that includes a feature vector that characterizes the world state and the user state in the multi-modal synchronized state; recognizing a process being performed by the user via one or more trained machine learning models that are configured to receive the multi-modal synchronized state including the feature vector as input and determine the process being performed by the user based on the multi-modal synchronized state, the process comprising a series of steps; selecting a working domain for the process being performed by the user, wherein the working domain includes a digital twin model that is a virtual representation that serves as a digital counterpart of the plurality of objects in the environment; aligning the digital twin model with the plurality of objects in the environment; detecting a current step in the process based on the multi-modal synchronized state; presenting, via a display of the computing device, one or more domain-specific instructions including virtually rendered cues that are aligned relative to the digital twin model of the plurality of objects in the environment, wherein the domain-specific instructions direct the user how to perform an expected action to complete the current step in the process; detecting a user action based on the multi-modal synchronized state; and based on the user action differing from the expected action for the current step in the process, presenting, via the display of the computing device, a clarifying question to the user, receiving, via an input subsystem of the computing device, a user response to the clarifying question, selecting domain-specific guidance including additional virtually rendered cues that are aligned relative to the digital twin model of the plurality of objects in the environment based on the user response to the clarifying question, and presenting, via the display of the computing device, the domain-specific guidance including the additional virtually rendered cues directing the user to perform the expected action. 2 . The method of claim 1 , further comprising: based on the user action matching the expected action for the current step in the process, presenting, via the display, one or more additional domain-specific instructions directing the user how to perform a next expected action to complete a next step in the process. 3 . The method of claim 1 , wherein the process includes user manipulation of a real-world object, wherein recognizing the process comprises recognizing the real-world object, and selecting the working domain from a plurality of different working domains corresponding to a plurality of different real-world objects based on the recognized real-world object. 4 . The method of claim 3 , wherein the digital twin model comprises metadata defining a plurality of parts of the real-world object, locations of the plurality of parts, and descriptions of the plurality of parts, and wherein the one or more domain-specific instructions and the domain-specific guidance comprises presenting, via the display, metadata of the digital twin model corresponding to a part of the real-world object involved in the current step in the process. 5 . The method of claim 4 , further comprising: determining that the user action differs from the expected action for the current step in the process based on the multi-modal synchronized state indicating that the user is interacting with a different part of the digital twin model than an expected part for the current step in the process. 6 . The method of claim 3 , wherein the display comprises near-eye display of an augmented-reality device, and wherein the domain-specific guidance comprises visually presenting, via the near-eye display, one or more of a virtual label indicating the part of the real-world object involved in the current step in the process and a virtual movement affordance indicating how to manipulate the part of the real-world object involved in the current step in the process. 7 . The method of claim 1 , wherein the one or more domain-specific instructions and the domain-specific guidance are provided within the frame of reference of the user based on the multi-modal synchronized state. 8 . The method of claim 1 , wherein tracking the user state comprises tracking one or more of a user head pose, one or more of the user's hand poses, and user speech. 9 . The method of claim 1 , wherein the user action is detected via an action-recognition machine-learning model previously trained on multi-modal sensor information corresponding to users performing different domain-specific user actions associated with the series of steps in the process. 10 . A computing system comprising: a plurality of sensors; a display; a processor; and a storage device holding instructions executable by the processor to: receive, via the plurality of sensors, multi-modal sensor information including image data and audio data; track a world state of a real-world physical environment including a plurality of objects based on the multi-modal sensor information by performing object recognition to identify the plurality of objects based at least on the multi-modal sensor information, and by performing object localization and mapping to determine a plurality of positions of the plurality of objects in the environment based at least on the multi-modal sensor information; track a user state in the real-world physical environment at least by tracking a pose of a user based on the multi-modal sensor information; generate a multi-modal synchronized state that aligns the world state and the user state in a common coordinate system that is aligned with a frame of reference of the user in a data structure that includes a feature vector that characterizes the world state and the user state in the multi-modal synchronized state; recognize a process being performed by the user via one or more trained machine learning models that are configured to receive the multi-modal synchronized state including the feature vector as input and determine the process being performed by the user based on the multi-modal synchronized state, the process comprising a series of steps; select a working domain for the process being performed by the user, wherein the working domain includes a digital twin model, the digital twin model comprising a virtual representation and serving as a digital counterpart of the plurality of objects in the environment; align the digital twin model with the plurality of objects in the environment; detect a current step in the process based on the multi-modal synchronized state; present, via the display, one or more domain-specific instructions including virtu
Mixed reality (object pose determination, tracking or camera calibration for mixed reality G06T7/00) · CPC title
Neural networks · CPC title
Audio in a user interface, e.g. using voice commands for navigating, audio feedback · CPC title
Gesture based interaction, e.g. based on a set of recognized hand gestures (interaction based on gestures traced on a digitiser G06F3/04883) · CPC title
Arrangements for interaction with the human body, e.g. for user immersion in virtual reality (blind teaching G09B21/00) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.