Utilizing artificial intelligence and machine learning models to reverse engineer an application from application artifacts
US-2021263733-A1 · Aug 26, 2021 · US
US12469272B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12469272-B2 |
| Application number | US-202318355877-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 20, 2023 |
| Priority date | Oct 14, 2020 |
| Publication date | Nov 11, 2025 |
| Grant date | Nov 11, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for training a generative artificial intelligence (AI)/machine learning (ML) model to recognize applications, screens, and UI elements using computer vision (CV) and to recognize user interactions with the applications, screens, and UI elements are disclosed. Optical character recognition (OCR) may also be used to assist in training the generative AI/ML model. Training of the generative AI/ML model may be performed without other system inputs such as system-level information (e.g., key presses, mouse clicks, locations, operating system operations, etc.) or application-level information (e.g., information from an application programming interface (API) from a software application executing on a computing system), or the training of the generative AI/ML model may be supplemented by other information, such as browser history, heat maps, file information, currently running applications and locations, system level and/or application-level information, etc.
Opening claim text (preview).
The invention claimed is: 1 . A system, comprising: one or more user computing systems comprising respective recorder processes; and a server configured to train a generative artificial intelligence (AI)/machine learning (ML) model to recognize applications, screens, and user interface (UI) elements using computer vision (CV) and to recognize user interactions with the applications, screens, and UI elements, wherein the respective recorder processes are configured to: record screenshots or video frames of a display associated with the respective user computing system and other information, and send the recorded screenshots or video frames, and the other information, to storage accessible by the server, and the server is configured to: initially train the generative AI/ML model to recognize the applications, screens, and UI elements that are present in the recorded screenshots or video frames using the recorded screenshots or video frames and the other information, and after the generative AI/ML model can recognize the applications, screens, and UI elements in the recorded screenshots or video frames with a confidence, train the generative AI/ML model to recognize individual user interactions with the UI elements. 2 . The system of claim 1 , wherein the individual user interactions comprise button presses, entry of single characters or character sequences, selection of active UI elements, menu selections, screen changes, voice inputs, gestures, providing biometric information, haptic interactions, or a combination thereof. 3 . The system of claim 1 , wherein the training of the generative AI/ML model to recognize the individual user interactions with the UI elements comprises comparing two or more consecutive screenshots or video frames and determining that a typed character appeared from one screenshot to another, a button was pressed, or a menu selection occurred. 4 . The system of claim 1 , wherein the other information comprises a web browser history, one or more heat maps, key presses, mouse clicks, locations of mouse clicks and/or graphical elements on the display that a user is interacting with, locations where the user was looking on the display, time stamps associated with the screenshots or video frames, text that the user entered, content that the user scrolled past, a time that the user stopped on a part of content shown in the display, what application the user is interacting with, voice inputs, gestures, emotion information, biometrics, information pertaining to periods of no user activity, haptic information, multi-touch input information, or a combination thereof. 5 . The system of claim 1 , wherein the one or more user computing systems or the server are configured to generate one or more heat maps, the other information comprising the one or more heat maps, and the one or more heat maps comprise a frequency that a user used applications, a frequency that the user interacted with components of the applications, locations of the components in the applications, content of the applications and/or components, or a combination thereof. 6 . The system of claim 5 , wherein the one or more user computing systems or the server are configured to derive the one or more heat maps from display analysis comprising detection of typed and/or pasted text, caret tracking, active element detection, or a combination thereof. 7 . The system of claim 1 , wherein the respective recorder processes are implemented as feedback loop processes that continuously or periodically compare a current screenshot or video frame to a previous screenshot or video frame and identify one or more locations where changes between the current screenshot or video frame and the previous screenshot or video frame occurred. 8 . The system of claim 7 , wherein the respective recorder processes are further configured to: perform optical character recognition (OCR) on the one or more locations where the changes occurred; compare results of the OCR to content of a keyboard queue to determine whether a match exists; and when a match exists, link text associated with the match to a respective location. 9 . The system of claim 1 , further comprising: an automation box operably connected to a user computing system of the one or more user computing systems, the automation box configured to: receive input from one or more user input devices, associate time stamps with the input, and send the time stamped input to storage accessible by the server, wherein the server is configured to use the time stamped input for the initial training of the generative AI/ML model. 10 . The system of claim 1 , wherein server is configured to perform the initial training of the generative AI/ML model without a priori knowledge of the applications, screens, and UI elements in the screenshots or video frames. 11 . The system of claim 1 , wherein the generative AI model is a large language model (LLM), a generative adversarial network (GAN), a variational autoencoder (VAE), or a transformer. 12 . A non-transitory computer-readable medium storing a computer program configured to train a generative artificial intelligence (AI)/machine learning (ML) model to recognize applications, screens, and user interface (UI) elements using computer vision (CV) and/or to recognize user interactions with the applications, screens, and UI elements, the computer program configured to cause at least one processor to: access recorded screenshots or video frames of displays associated with one or more computing systems and access other information associated with the one or more computing systems; and initially train the generative AI/ML model to recognize the applications, screens, and UI elements that are present in the recorded screenshots or video frames using the recorded screenshots or video frames and the other information, wherein the initial training of the generative AI/ML model is performed without a priori knowledge of the applications, screens, and UI elements in the screenshots or video frames. 13 . The non-transitory computer-readable medium of claim 12 , wherein after the generative AI/ML model can recognize the applications, screens, and UI elements in the recorded screenshots or video frames with a confidence, the computer program is further configured to cause the at least one processor to: train the generative AI/ML model to recognize individual user interactions with the UI elements. 14 . The non-transitory computer-readable medium of claim 13 , wherein the training of the generative AI/ML model to recognize the individual user interactions with the UI elements comprises comparing two or more consecutive screenshots or video frames and determining that a typed character appeared from one to another, a button was pressed, or a menu selection occurred. 15 . The non-transitory computer-readable medium of claim 13 , wherein the individual user interactions comprise button presses, entry of single characters or character sequences, selection of active UI elements, menu selections, screen changes, voice inputs, gestures, providing biometric information, haptic interactions, or a combination thereof. 16 . The non-transitory computer-readable medium of claim 12 , wherein the other information comprises a web browser history, one or more heat maps, key presses, mouse clicks, locations of mouse clicks and/or graphical elements on the display that a user is interacting with, locations where the user was looking on the display, time stamps associated with the screenshots or video frames, text that the user entered, con
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Movements or behaviour, e.g. gesture recognition (recognition of facial expressions G06V40/16) · CPC title
Recognising information on displays, dials, clocks · CPC title
Image or video pattern matching; Proximity measures in feature spaces · CPC title
Character recognition · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.