Data augmentation using machine translation capabilities of language models
US-12354011-B2 · Jul 8, 2025 · US
US12488430B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12488430-B2 |
| Application number | US-202218000285-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 19, 2022 |
| Priority date | May 19, 2022 |
| Publication date | Dec 2, 2025 |
| Grant date | Dec 2, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The disclosure herein describes training an encoder network to inpaint images with masked portions. A primary encoding process is used to encode a visible portion of a masked input image into encoded token data. The encoded token data is then decoded into both pixel regression output and feature prediction output, wherein both outputs include inpainted image data associated with the masked portion of the masked input image. A pixel regression loss is determined using the pixel regression output and pixel data of an unmasked version of the masked input image. A feature prediction loss is determined using the feature prediction output and ground truth encoding output of the unmasked version of the masked input image. The primary encoding process is then trained using the pixel regression loss and the feature prediction loss, whereby the primary encoding process is trained to encode structural features of input images into encoded token data.
Opening claim text (preview).
What is claimed is: 1 . A system comprising: at least one processor; and at least one memory comprising computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the at least one processor to: encode, using a primary encoding process, a visible portion of a masked input image into encoded token data, wherein the masked input image includes the visible portion and a masked portion; decode the encoded token data into pixel regression output, the pixel regression output including inpainted image pixel data associated with the masked portion of the masked input image; decode the encoded token data into feature predictor output, the feature predictor output including inpainted image feature data associated with the masked portion of the masked input image; apply a ground truth momentum encoding process to an unmasked version of the masked input image; train the primary encoding process using the inpainted image pixel data of the pixel regression output and the inpainted image feature data of the feature predictor output, whereby the primary encoding process is trained to encode structural features of input images into encoded token data; and update the ground truth momentum encoding process based on changes made to the primary encoding process. 2 . The system of claim 1 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: determine a pixel regression loss using the pixel regression output and pixel data of an unmasked version of the masked input image; determine a feature prediction loss using the feature prediction output and ground truth encoding output of the ground truth momentum encoding process applied to the unmasked version of the masked input image; and update parameters of the ground truth momentum encoding process based on an exponential moving average (EMA) of parameters of the trained primary encoding process. 3 . The system of claim 2 , wherein the pixel regression output is decoded from the encoded token data by a pixel regressor, and the feature prediction output is decoded from the encoded token data by a feature predictor; and wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: train the pixel regressor using the determined pixel regression loss; and train the feature predictor using the determined feature prediction loss; wherein training the primary encoding process using the inpainted image pixel data of the pixel regression output and the inpainted image feature data of the feature predictor output includes training the primary encoding process using the determined pixel regression loss and the determined feature prediction loss. 4 . The system of claim 3 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: obtain low-level feature data based on the visible portion of the masked input image from the primary encoding process; provide the obtained low-level feature data to the pixel regressor, wherein the provided low-level feature data is used for decoding the encoded token data into the pixel regression output; obtain high-level feature data based on the visible portion of the masked input image from the primary encoding process; and provide the obtained high-level feature data to the feature predictor, wherein the provided high-level feature data is used for decoding the encoded token data into the feature prediction output. 5 . The system of claim 4 , wherein the low-level feature data is obtained from a portion of the primary encoding process prior to a transformation subprocess of the primary encoding process; wherein the low-level feature data is provided to each block of the pixel regressor; wherein the high-level feature data is obtained from a portion of the primary encoding process after a transformation subprocess of the primary encoding process; and wherein the high-level feature data is provided to each block of the feature predictor. 6 . The system of claim 1 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: receive a second masked input image; and generate an inpainted output image from the received second masked input image using the trained primary encoding process and at least one decoding process. 7 . The system of claim 1 , wherein the at least one memory and the computer program code is configured to, with the at least one processor, further cause the at least one processor to: receive an unmasked version of the masked input image; divide the received unmasked version of the masked input image into a set of non-overlapping patches; and apply a mask to a first subset of the set of non-overlapping patches, wherein the first subset of patches is a set of masked patches and a second subset of the set of non-overlapping patches is a set of visible patches; wherein the masked portion of the masked input image includes the set of masked patches, and the visible portion of the masked input image includes the set of visible patches; and wherein the encoded token data includes an encoded token for each visible patch of the set of visible patches. 8 . A computerized method comprising: encoding, by a processor, using a primary encoding process, a visible portion of a masked input image into encoded token data, wherein the masked input image includes the visible portion and a masked portion; decoding, by the processor, the encoded token data into pixel regression output, the pixel regression output including inpainted image pixel data associated with the masked portion of the masked input image; decoding, by the processor, the encoded token data into feature predictor output, the feature predictor output including inpainted image feature data associated with the masked portion of the masked input image; applying a ground truth momentum encoding process to an unmasked version of the masked input image; training, by the processor, the primary encoding process using the inpainted image pixel data of the pixel regression output and the inpainted image feature data of the feature predictor output, whereby the primary encoding process is trained to encode structural features of input images into encoded token data; and updating the ground truth momentum encoding process based on changes made to the primary encoding process. 9 . The computerized method of claim 8 , further comprising: determining a pixel regression loss using the pixel regression output and pixel data of an unmasked version of the masked input image; determining a feature prediction loss using the feature prediction output and ground truth encoding output of the ground truth momentum encoding process applied to the unmasked version of the masked input image; and updating, by the processor, parameters of the ground truth momentum encoding process based on an exponential moving average (EMA) of parameters of the trained primary encoding process. 10 . The computerized method of claim 9 , wherein the pixel regression output is decoded from the encoded token data by a pixel regressor, and the feature prediction output is decoded from the encoded token data by a feature predictor; and the computerized method further comprising: training, by the processor, the pixel regressor using the determined pixel regression loss; and training, by the processor, the feature predictor
Training; Learning · CPC title
Dividing image into blocks, subimages or windows · CPC title
Retouching; Inpainting; Scratch removal · CPC title
using regression, e.g. by projecting features on hyperplanes · CPC title
Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.