Training a generative artificial intelligence / machine learning model to recognize applications, screens, and user interface elements using computer vision

US12469272B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12469272-B2
Application numberUS-202318355877-A
CountryUS
Kind codeB2
Filing dateJul 20, 2023
Priority dateOct 14, 2020
Publication dateNov 11, 2025
Grant dateNov 11, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for training a generative artificial intelligence (AI)/machine learning (ML) model to recognize applications, screens, and UI elements using computer vision (CV) and to recognize user interactions with the applications, screens, and UI elements are disclosed. Optical character recognition (OCR) may also be used to assist in training the generative AI/ML model. Training of the generative AI/ML model may be performed without other system inputs such as system-level information (e.g., key presses, mouse clicks, locations, operating system operations, etc.) or application-level information (e.g., information from an application programming interface (API) from a software application executing on a computing system), or the training of the generative AI/ML model may be supplemented by other information, such as browser history, heat maps, file information, currently running applications and locations, system level and/or application-level information, etc.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A system, comprising: one or more user computing systems comprising respective recorder processes; and a server configured to train a generative artificial intelligence (AI)/machine learning (ML) model to recognize applications, screens, and user interface (UI) elements using computer vision (CV) and to recognize user interactions with the applications, screens, and UI elements, wherein the respective recorder processes are configured to: record screenshots or video frames of a display associated with the respective user computing system and other information, and send the recorded screenshots or video frames, and the other information, to storage accessible by the server, and the server is configured to: initially train the generative AI/ML model to recognize the applications, screens, and UI elements that are present in the recorded screenshots or video frames using the recorded screenshots or video frames and the other information, and after the generative AI/ML model can recognize the applications, screens, and UI elements in the recorded screenshots or video frames with a confidence, train the generative AI/ML model to recognize individual user interactions with the UI elements. 2 . The system of claim 1 , wherein the individual user interactions comprise button presses, entry of single characters or character sequences, selection of active UI elements, menu selections, screen changes, voice inputs, gestures, providing biometric information, haptic interactions, or a combination thereof. 3 . The system of claim 1 , wherein the training of the generative AI/ML model to recognize the individual user interactions with the UI elements comprises comparing two or more consecutive screenshots or video frames and determining that a typed character appeared from one screenshot to another, a button was pressed, or a menu selection occurred. 4 . The system of claim 1 , wherein the other information comprises a web browser history, one or more heat maps, key presses, mouse clicks, locations of mouse clicks and/or graphical elements on the display that a user is interacting with, locations where the user was looking on the display, time stamps associated with the screenshots or video frames, text that the user entered, content that the user scrolled past, a time that the user stopped on a part of content shown in the display, what application the user is interacting with, voice inputs, gestures, emotion information, biometrics, information pertaining to periods of no user activity, haptic information, multi-touch input information, or a combination thereof. 5 . The system of claim 1 , wherein the one or more user computing systems or the server are configured to generate one or more heat maps, the other information comprising the one or more heat maps, and the one or more heat maps comprise a frequency that a user used applications, a frequency that the user interacted with components of the applications, locations of the components in the applications, content of the applications and/or components, or a combination thereof. 6 . The system of claim 5 , wherein the one or more user computing systems or the server are configured to derive the one or more heat maps from display analysis comprising detection of typed and/or pasted text, caret tracking, active element detection, or a combination thereof. 7 . The system of claim 1 , wherein the respective recorder processes are implemented as feedback loop processes that continuously or periodically compare a current screenshot or video frame to a previous screenshot or video frame and identify one or more locations where changes between the current screenshot or video frame and the previous screenshot or video frame occurred. 8 . The system of claim 7 , wherein the respective recorder processes are further configured to: perform optical character recognition (OCR) on the one or more locations where the changes occurred; compare results of the OCR to content of a keyboard queue to determine whether a match exists; and when a match exists, link text associated with the match to a respective location. 9 . The system of claim 1 , further comprising: an automation box operably connected to a user computing system of the one or more user computing systems, the automation box configured to: receive input from one or more user input devices, associate time stamps with the input, and send the time stamped input to storage accessible by the server, wherein the server is configured to use the time stamped input for the initial training of the generative AI/ML model. 10 . The system of claim 1 , wherein server is configured to perform the initial training of the generative AI/ML model without a priori knowledge of the applications, screens, and UI elements in the screenshots or video frames. 11 . The system of claim 1 , wherein the generative AI model is a large language model (LLM), a generative adversarial network (GAN), a variational autoencoder (VAE), or a transformer. 12 . A non-transitory computer-readable medium storing a computer program configured to train a generative artificial intelligence (AI)/machine learning (ML) model to recognize applications, screens, and user interface (UI) elements using computer vision (CV) and/or to recognize user interactions with the applications, screens, and UI elements, the computer program configured to cause at least one processor to: access recorded screenshots or video frames of displays associated with one or more computing systems and access other information associated with the one or more computing systems; and initially train the generative AI/ML model to recognize the applications, screens, and UI elements that are present in the recorded screenshots or video frames using the recorded screenshots or video frames and the other information, wherein the initial training of the generative AI/ML model is performed without a priori knowledge of the applications, screens, and UI elements in the screenshots or video frames. 13 . The non-transitory computer-readable medium of claim 12 , wherein after the generative AI/ML model can recognize the applications, screens, and UI elements in the recorded screenshots or video frames with a confidence, the computer program is further configured to cause the at least one processor to: train the generative AI/ML model to recognize individual user interactions with the UI elements. 14 . The non-transitory computer-readable medium of claim 13 , wherein the training of the generative AI/ML model to recognize the individual user interactions with the UI elements comprises comparing two or more consecutive screenshots or video frames and determining that a typed character appeared from one to another, a button was pressed, or a menu selection occurred. 15 . The non-transitory computer-readable medium of claim 13 , wherein the individual user interactions comprise button presses, entry of single characters or character sequences, selection of active UI elements, menu selections, screen changes, voice inputs, gestures, providing biometric information, haptic interactions, or a combination thereof. 16 . The non-transitory computer-readable medium of claim 12 , wherein the other information comprises a web browser history, one or more heat maps, key presses, mouse clicks, locations of mouse clicks and/or graphical elements on the display that a user is interacting with, locations where the user was looking on the display, time stamps associated with the screenshots or video frames, text that the user entered, con

Assignees

Inventors

Classifications

  • Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title

  • Recognising information on displays, dials, clocks · CPC title

  • Image or video pattern matching; Proximity measures in feature spaces · CPC title

  • Character recognition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12469272B2 cover?
Techniques for training a generative artificial intelligence (AI)/machine learning (ML) model to recognize applications, screens, and UI elements using computer vision (CV) and to recognize user interactions with the applications, screens, and UI elements are disclosed. Optical character recognition (OCR) may also be used to assist in training the generative AI/ML model. Training of the generat…
Who is the assignee on this patent?
Uipath Inc
What technology area does this patent fall under?
Primary CPC classification G06V10/82. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).