Machine-learned models for user interface prediction, generation, and interaction understanding

US12197930B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12197930-B2
Application numberUS-202318466267-A
CountryUS
Kind codeB2
Filing dateSep 13, 2023
Priority dateJun 1, 2021
Publication dateJan 14, 2025
Grant dateJan 14, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Generally, the present disclosure is directed to user interface understanding. More particularly, the present disclosure relates to training and utilization of machine-learned models for user interface prediction and/or generation. A machine-learned interface prediction model can be pre-trained using a variety of pre-training tasks for eventual downstream task training and utilization (e.g., interface prediction, interface generation, etc.).

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for training and utilization of machine-learned models for user interface interaction understanding, comprising: obtaining, by a computing system comprising one or more computing devices, interface data that comprises a sequence of two or more user interfaces obtained through performance of one or more user interactions which result in generation of the sequence of two or more user interfaces, wherein, for each user interface in the sequence of two or more user interfaces, the interface data comprises one or more interface images depicting the user interface; determining, by the computing system, a plurality of intermediate embeddings based at least in part on the interface data; processing, by the computing system, the plurality of intermediate embeddings with a machine-learned interface prediction model to obtain one or more user interface embeddings; and performing, by the computing system, fine-tuning on the machine-learned interface prediction model based on the one or more user interface embeddings. 2. The computer-implemented method of claim 1 , wherein the interface data further comprises data descriptive of one or more link components, wherein the one or more link components comprise elements of the user interfaces which are the targets of the user interactions. 3. The computer-implemented method of claim 2 , wherein the data descriptive of one or more link components comprises images of the one or more link components. 4. The computer-implemented method of claim 1 , wherein the plurality of intermediate embeddings comprises one or more image embeddings, one or more textual embeddings, and one or more positional embeddings. 5. The computer-implemented method of claim 1 , wherein the plurality of intermediate embeddings comprises one or more vision embeddings. 6. The computer-implemented method of claim 5 , wherein determining the plurality of intermediate embeddings based at least in part on the interface data comprises: fine-tuning, by the computing system, a neural network to detect user interface components in the one or more interface images; and determining, by the computing system, the one or more vision embeddings based on the detected user interface components. 7. The computer-implemented method of claim 6 , wherein determining the one or more vision embeddings based on the detected user interface components comprises: cropping, by the computing system, the one or more user interface images; providing, by the computing system, the cropped one or more user interface images to a vision encoder; and receiving, by the computing system, one or more flattened feature maps as the one or more vision embeddings from the vision encoder, the one or more flattened feature maps being generated by the vision encoder based on the one or more cropped user interface images. 8. The computer-implemented method of claim 1 , wherein performing fine-tuning on the machine-learned interface prediction model based on the one or more user interface embeddings comprises: formatting, by the computing system, a downstream input into one or more segments; evaluating, by the computing system, a task-specific loss function for the machine-learned interface prediction model; and adjusting, by the computing system, one or more parameters of the machine-learned interface prediction model based at least in part on the task-specific loss function. 9. The computer-implemented method of claim 8 , wherein the downstream input is a single user interface input, and wherein the interface data for one of the two or more user interfaces is left empty. 10. The computer-implemented method of claim 8 , wherein a specific task associated with the task-specific loss function is natural language input, and wherein the downstream input is a language input comprising a text segment of a first user interface of the two or more user interfaces as a text token and the interface image depicting the first user interface as a corresponding vision token. 11. The computer-implemented method of claim 1 , wherein the interface data further comprises structural data that is indicative of one or more positions of each of a plurality of interface elements included in the user interface, and wherein the structural data for each user interface comprises view hierarchy data associated with the user interface. 12. A computing system, comprising: one or more processors; and one or more tangible, non-transitory computer readable media storing computer-readable instructions that when executed by one or more processors cause the one or more processors to perform operations, the operations comprising: obtaining interface data that comprises a sequence of two or more user interfaces obtained through performance of one or more user interactions which result in generation of the sequence of two or more user interfaces, wherein, for each user interface in the sequence of two or more user interfaces, the interface data comprises one or more interface images depicting the user interface; determining a plurality of intermediate embeddings based at least in part on the interface data; processing the plurality of intermediate embeddings with a fine-tuned machine-learned interface prediction model to obtain one or more user interface embeddings; performing a prediction task based at least in part on the one or more user interface embeddings to obtain a prediction output; and fine-tuning the machine-learned interface prediction model using a task-specific loss function based on the prediction task. 13. The computing system of claim 12 , wherein the prediction task is similar user interface component retrieval, and wherein performing the prediction task comprises: selecting a candidate user interface component based on a given component associated with at least one user interface embedding of the one or more user-interface embeddings. 14. The computing system of claim 13 , wherein performing the prediction task comprises: generating a component level embedding for the most similar user interface component using the machine-learned interface prediction model; determining a dot product of the at least one user interface embedding and the component level embedding; and determining a similarity score for the candidate user interface component based on the dot product. 15. The computing system of claim 12 , wherein the prediction task is expression component retrieval. 16. The computing system of claim 15 , wherein performing the prediction task comprises: receiving, as input, a referring expression and an image of a user interface currently displayed; and selecting, from components of the user interface detected in the image of the user interface, a component referred to by the referring expression as the prediction output. 17. The computing system of claim 12 , wherein the prediction task is icon classification. 18. The computing system of claim 17 , wherein performing the prediction task comprises: obtaining a first user interface embedding from the one or more user interface embeddings, the first user interface embedding for a user interface component at a first position in at least one user interface of the two or more user interfaces; using the first user interface embedding as a contextual embedding for the user interface component; and classifying an icon for the user interface component based on the contextual embedding. 19. The computing system of claim 12 , wherein the prediction task is application type class

Assignees

Inventors

Classifications

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Supervised learning · CPC title

  • Combinations of networks · CPC title

  • nonlinear criteria, e.g. embedding a manifold in a Euclidean space · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12197930B2 cover?
Generally, the present disclosure is directed to user interface understanding. More particularly, the present disclosure relates to training and utilization of machine-learned models for user interface prediction and/or generation. A machine-learned interface prediction model can be pre-trained using a variety of pre-training tasks for eventual downstream task training and utilization (e.g., in…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F9/451. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 14 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).