Training masked autoencoders for image inpainting

US12488430B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12488430-B2
Application numberUS-202218000285-A
CountryUS
Kind codeB2
Filing dateMay 19, 2022
Priority dateMay 19, 2022
Publication dateDec 2, 2025
Grant dateDec 2, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosure herein describes training an encoder network to inpaint images with masked portions. A primary encoding process is used to encode a visible portion of a masked input image into encoded token data. The encoded token data is then decoded into both pixel regression output and feature prediction output, wherein both outputs include inpainted image data associated with the masked portion of the masked input image. A pixel regression loss is determined using the pixel regression output and pixel data of an unmasked version of the masked input image. A feature prediction loss is determined using the feature prediction output and ground truth encoding output of the unmasked version of the masked input image. The primary encoding process is then trained using the pixel regression loss and the feature prediction loss, whereby the primary encoding process is trained to encode structural features of input images into encoded token data.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system comprising: at least one processor; and at least one memory comprising computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the at least one processor to: encode, using a primary encoding process, a visible portion of a masked input image into encoded token data, wherein the masked input image includes the visible portion and a masked portion; decode the encoded token data into pixel regression output, the pixel regression output including inpainted image pixel data associated with the masked portion of the masked input image; decode the encoded token data into feature predictor output, the feature predictor output including inpainted image feature data associated with the masked portion of the masked input image; apply a ground truth momentum encoding process to an unmasked version of the masked input image; train the primary encoding process using the inpainted image pixel data of the pixel regression output and the inpainted image feature data of the feature predictor output, whereby the primary encoding process is trained to encode structural features of input images into encoded token data; and update the ground truth momentum encoding process based on changes made to the primary encoding process. 2 . The system of claim 1 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: determine a pixel regression loss using the pixel regression output and pixel data of an unmasked version of the masked input image; determine a feature prediction loss using the feature prediction output and ground truth encoding output of the ground truth momentum encoding process applied to the unmasked version of the masked input image; and update parameters of the ground truth momentum encoding process based on an exponential moving average (EMA) of parameters of the trained primary encoding process. 3 . The system of claim 2 , wherein the pixel regression output is decoded from the encoded token data by a pixel regressor, and the feature prediction output is decoded from the encoded token data by a feature predictor; and wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: train the pixel regressor using the determined pixel regression loss; and train the feature predictor using the determined feature prediction loss; wherein training the primary encoding process using the inpainted image pixel data of the pixel regression output and the inpainted image feature data of the feature predictor output includes training the primary encoding process using the determined pixel regression loss and the determined feature prediction loss. 4 . The system of claim 3 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: obtain low-level feature data based on the visible portion of the masked input image from the primary encoding process; provide the obtained low-level feature data to the pixel regressor, wherein the provided low-level feature data is used for decoding the encoded token data into the pixel regression output; obtain high-level feature data based on the visible portion of the masked input image from the primary encoding process; and provide the obtained high-level feature data to the feature predictor, wherein the provided high-level feature data is used for decoding the encoded token data into the feature prediction output. 5 . The system of claim 4 , wherein the low-level feature data is obtained from a portion of the primary encoding process prior to a transformation subprocess of the primary encoding process; wherein the low-level feature data is provided to each block of the pixel regressor; wherein the high-level feature data is obtained from a portion of the primary encoding process after a transformation subprocess of the primary encoding process; and wherein the high-level feature data is provided to each block of the feature predictor. 6 . The system of claim 1 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: receive a second masked input image; and generate an inpainted output image from the received second masked input image using the trained primary encoding process and at least one decoding process. 7 . The system of claim 1 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: receive an unmasked version of the masked input image; divide the received unmasked version of the masked input image into a set of non-overlapping patches; and apply a mask to a first subset of the set of non-overlapping patches, wherein the first subset of patches is a set of masked patches and a second subset of the set of non-overlapping patches is a set of visible patches; wherein the masked portion of the masked input image includes the set of masked patches, and the visible portion of the masked input image includes the set of visible patches; and wherein the encoded token data includes an encoded token for each visible patch of the set of visible patches. 8 . A computerized method comprising: encoding, by a processor, using a primary encoding process, a visible portion of a masked input image into encoded token data, wherein the masked input image includes the visible portion and a masked portion; decoding, by the processor, the encoded token data into pixel regression output, the pixel regression output including inpainted image pixel data associated with the masked portion of the masked input image; decoding, by the processor, the encoded token data into feature predictor output, the feature predictor output including inpainted image feature data associated with the masked portion of the masked input image; applying a ground truth momentum encoding process to an unmasked version of the masked input image; training, by the processor, the primary encoding process using the inpainted image pixel data of the pixel regression output and the inpainted image feature data of the feature predictor output, whereby the primary encoding process is trained to encode structural features of input images into encoded token data; and updating the ground truth momentum encoding process based on changes made to the primary encoding process. 9 . The computerized method of claim 8 , further comprising: determining a pixel regression loss using the pixel regression output and pixel data of an unmasked version of the masked input image; determining a feature prediction loss using the feature prediction output and ground truth encoding output of the ground truth momentum encoding process applied to the unmasked version of the masked input image; and updating, by the processor, parameters of the ground truth momentum encoding process based on an exponential moving average (EMA) of parameters of the trained primary encoding process. 10 . The computerized method of claim 9 , wherein the pixel regression output is decoded from the encoded token data by a pixel regressor, and the feature prediction output is decoded from the encoded token data by a feature predictor; and the computerized method further comprising: training, by the processor, the pixel regressor using the determined pixel regression loss; and training, by the processor, the feature predictor

Assignees

Inventors

Classifications

  • Training; Learning · CPC title

  • Dividing image into blocks, subimages or windows · CPC title

  • G06T5/77Primary

    Retouching; Inpainting; Scratch removal · CPC title

  • using regression, e.g. by projecting features on hyperplanes · CPC title

  • Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12488430B2 cover?
The disclosure herein describes training an encoder network to inpaint images with masked portions. A primary encoding process is used to encode a visible portion of a masked input image into encoded token data. The encoded token data is then decoded into both pixel regression output and feature prediction output, wherein both outputs include inpainted image data associated with the masked port…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06T5/77. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).