Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06T5/77. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Training masked autoencoders for image inpainting

US12488430B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12488430-B2
Application number	US-202218000285-A
Country	US
Kind code	B2
Filing date	May 19, 2022
Priority date	May 19, 2022
Publication date	Dec 2, 2025
Grant date	Dec 2, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosure herein describes training an encoder network to inpaint images with masked portions. A primary encoding process is used to encode a visible portion of a masked input image into encoded token data. The encoded token data is then decoded into both pixel regression output and feature prediction output, wherein both outputs include inpainted image data associated with the masked portion of the masked input image. A pixel regression loss is determined using the pixel regression output and pixel data of an unmasked version of the masked input image. A feature prediction loss is determined using the feature prediction output and ground truth encoding output of the unmasked version of the masked input image. The primary encoding process is then trained using the pixel regression loss and the feature prediction loss, whereby the primary encoding process is trained to encode structural features of input images into encoded token data.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system comprising: at least one processor; and at least one memory comprising computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the at least one processor to: encode, using a primary encoding process, a visible portion of a masked input image into encoded token data, wherein the masked input image includes the visible portion and a masked portion; decode the encoded token data into pixel regression output, the pixel regression output including inpainted image pixel data associated with the masked portion of the masked input image; decode the encoded token data into feature predictor output, the feature predictor output including inpainted image feature data associated with the masked portion of the masked input image; apply a ground truth momentum encoding process to an unmasked version of the masked input image; train the primary encoding process using the inpainted image pixel data of the pixel regression output and the inpainted image feature data of the feature predictor output, whereby the primary encoding process is trained to encode structural features of input images into encoded token data; and update the ground truth momentum encoding process based on changes made to the primary encoding process. 2 . The system of claim 1 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: determine a pixel regression loss using the pixel regression output and pixel data of an unmasked version of the masked input image; determine a feature prediction loss using the feature prediction output and ground truth encoding output of the ground truth momentum encoding process applied to the unmasked version of the masked input image; and update parameters of the ground truth momentum encoding process based on an exponential moving average (EMA) of parameters of the trained primary encoding process. 3 . The system of claim 2 , wherein the pixel regression output is decoded from the encoded token data by a pixel regressor, and the feature prediction output is decoded from the encoded token data by a feature predictor; and wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: train the pixel regressor using the determined pixel regression loss; and train the feature predictor using the determined feature prediction loss; wherein training the primary encoding process using the inpainted image pixel data of the pixel regression output and the inpainted image feature data of the feature predictor output includes training the primary encoding process using the determined pixel regression loss and the determined feature prediction loss. 4 . The system of claim 3 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: obtain low-level feature data based on the visible portion of the masked input image from the primary encoding process; provide the obtained low-level feature data to the pixel regressor, wherein the provided low-level feature data is used for decoding the encoded token data into the pixel regression output; obtain high-level feature data based on the visible portion of the masked input image from the primary encoding process; and provide the obtained high-level feature data to the feature predictor, wherein the provided high-level feature data is used for decoding the encoded token data into the feature prediction output. 5 . The system of claim 4 , wherein the low-level feature data is obtained from a portion of the primary encoding process prior to a transformation subprocess of the primary encoding process; wherein the low-level feature data is provided to each block of the pixel regressor; wherein the high-level feature data is obtained from a portion of the primary encoding process after a transformation subprocess of the primary encoding process; and wherein the high-level feature data is provided to each block of the feature predictor. 6 . The system of claim 1 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: receive a second masked input image; and generate an inpainted output image from the received second masked input image using the trained primary encoding process and at least one decoding process. 7 . The system of claim 1 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: receive an unmasked version of the masked input image; divide the received unmasked version of the masked input image into a set of non-overlapping patches; and apply a mask to a first subset of the set of non-overlapping patches, wherein the first subset of patches is a set of masked patches and a second subset of the set of non-overlapping patches is a set of visible patches; wherein the masked portion of the masked input image includes the set of masked patches, and the visible portion of the masked input image includes the set of visible patches; and wherein the encoded token data includes an encoded token for each visible patch of the set of visible patches. 8 . A computerized method comprising: encoding, by a processor, using a primary encoding process, a visible portion of a masked input image into encoded token data, wherein the masked input image includes the visible portion and a masked portion; decoding, by the processor, the encoded token data into pixel regression output, the pixel regression output including inpainted image pixel data associated with the masked portion of the masked input image; decoding, by the processor, the encoded token data into feature predictor output, the feature predictor output including inpainted image feature data associated with the masked portion of the masked input image; applying a ground truth momentum encoding process to an unmasked version of the masked input image; training, by the processor, the primary encoding process using the inpainted image pixel data of the pixel regression output and the inpainted image feature data of the feature predictor output, whereby the primary encoding process is trained to encode structural features of input images into encoded token data; and updating the ground truth momentum encoding process based on changes made to the primary encoding process. 9 . The computerized method of claim 8 , further comprising: determining a pixel regression loss using the pixel regression output and pixel data of an unmasked version of the masked input image; determining a feature prediction loss using the feature prediction output and ground truth encoding output of the ground truth momentum encoding process applied to the unmasked version of the masked input image; and updating, by the processor, parameters of the ground truth momentum encoding process based on an exponential moving average (EMA) of parameters of the trained primary encoding process. 10 . The computerized method of claim 9 , wherein the pixel regression output is decoded from the encoded token data by a pixel regressor, and the feature prediction output is decoded from the encoded token data by a feature predictor; and the computerized method further comprising: training, by the processor, the pixel regressor using the determined pixel regression loss; and training, by the processor, the feature predictor

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06T2207/20081
Training; Learning · CPC title
G06T2207/20021
Dividing image into blocks, subimages or windows · CPC title
G06T5/77Primary
Retouching; Inpainting; Scratch removal · CPC title
G06V10/766
using regression, e.g. by projecting features on hyperplanes · CPC title
G06V10/774
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

Patent family

Related publications grouped by family.

View patent family 82100291

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12488430B2 cover?: The disclosure herein describes training an encoder network to inpaint images with masked portions. A primary encoding process is used to encode a visible portion of a masked input image into encoded token data. The encoded token data is then decoded into both pixel regression output and feature prediction output, wherein both outputs include inpainted image data associated with the masked port…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06T5/77. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Data augmentation using machine translation capabilities of language models

Unsupervised style and color cues for transformer-based image generation

Diverse image inpainting using contrastive learning

Method and system for high-resolution image inpainting

Weakly-supervised spatial context networks to recognize features within an image

Frequently asked questions