Unified pretraining framework for document understanding

US12333845B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12333845-B2
Application numberUS-202117528061-A
CountryUS
Kind codeB2
Filing dateNov 16, 2021
Priority dateNov 16, 2021
Publication dateJun 17, 2025
Grant dateJun 17, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The technology described includes methods for pretraining a document encoder model based on multimodal self cross-attention. One method includes receiving image data that encodes a set of pretraining documents. A set of sentences is extracted from the image data. A bounding box for each sentence is generated. For each sentence, a set of predicted features is generated by using an encoder machine-learning model. The encoder model performs cross-attention between a set of masked-textual features for the sentence and a set of masked-visual features for the sentence. The set of masked-textual features is based on a masking function and the sentence. The set of masked-visual features is based on the masking function and the corresponding bounding box. A document-encoder model is pretrained based on the set of predicted features for each sentence and pretraining tasks. The pretraining tasks includes masked sentence modeling, visual contrastive learning, or visual-language alignment.

First claim

Opening claim text (preview).

What is claimed: 1. A non-transitory computer-readable storage medium having instructions stored thereon, which, when executed by a processor of a computing device cause the processor to perform actions comprising: receiving image data that encodes a document; extracting, from the image data, a set of sentences and a corresponding bounding box for each sentence of the set of sentences; for each sentence of the set of sentences, generating a set of predicted features using an encoder machine learning (ML) model that performs cross-attention between a set of masked-textual features for the sentence and a set of masked-visual features for the sentence, wherein the set of masked-textual features are based on a masking function and the sentence and the set of masked-visual features are based on the masking function and the corresponding bounding box for the sentence; and pretraining a document-encoder ML model based on the set of predicted features for each sentence of the set of sentences and one or more pretraining tasks including visual-language alignment to enforce alignment between text and image regions and jointly pretraining, in association with pretraining the document-encoder ML, an image encoder that derives visual features for semantic regions, wherein at least one visual feature comprises a table, a font size, a style, or a figure. 2. The computer-readable storage medium of claim 1 , wherein the actions further comprise: for each sentence of the set of sentences, generating a textual embedding by using a sentence encoder model and a corresponding visual embedding by using a convolution model and a portion of the image data associated with the corresponding bounding box; and for each sentence of the set of sentences, generating the set of predicted features further based on the textual embedding for the sentence and the corresponding visual embedding. 3. The computer-readable storage medium of claim 2 , wherein the actions further comprise: for each sentence of the set of sentences, generating the set of masked-textual features and the set of masked-visual features by using the masking function, the textual embedding for the sentence, and the corresponding visual embedding. 4. The computer-readable storage medium of claim 2 , wherein generating a textual embedding for a sentence of the set of sentences comprises: generating a sentence embedding for the sentence by using the sentence encoding model and a multiset of tokens included in the sentence; generating a position embedding for the corresponding bounding box based on a position, within the document, of the corresponding bounding box; and generating the textual embedding for the sentence by using a combination of the sentence embedding and the position embedding for the bounding box. 5. The computer-readable storage medium of claim 2 , wherein generating a corresponding visual embedding for a sentence of the set of sentences comprises: generating a position embedding for the corresponding bounding box based on a position, within the document, of the corresponding bounding box; generating a region-of-interest (ROI) embedding for the corresponding bounding box by using the convolution model and the portion of the image data associated with the corresponding bounding box; and generating the corresponding visual embedding for the sentence based on a combination of the ROI embedding and the position embedding for the bounding box. 6. The computer-readable storage medium of claim 5 , wherein the actions further comprise: for each sentence of the set of sentences, generating the set of predicted features based further on the position embedding for the bounding box. 7. The computer-readable storage medium of claim 2 , wherein the actions further comprise: for each sentence of the set of sentences, generating a corresponding set of visual representations by using a vector quantization method to discretize the corresponding visual embedding; and for each sentence of the set of sentences, generating the set of masked-visual features by applying the visual mask on the corresponding set of visual representations. 8. The one or more computer-readable storage media of claim 2 , wherein generating the set of masked-textual features and the set of masked-visual features is further based on the masking function stochastically masking the textual embedding for the sentence and the corresponding visual embedding. 9. The one or more computer-readable storage media of claim 1 , wherein the one or more pretraining tasks includes at least one of masked sentence modeling or a visual contrastive learning. 10. A method comprising: receiving image data that encodes a document; extracting, from the image data, a set of sentences and a corresponding bounding box for each sentence of the set of sentences; for each sentence of the set of sentences, generating a textual embedding by using a sentence encoder model and a corresponding visual embedding by using a convolution model and a portion of the image data associated with the corresponding bounding box; for each sentence of the set of sentences, generating a set of masked-textual features and a set of masked-visual features by using a masking function, the textual embedding for the sentence, and the corresponding visual embedding; for each sentence of the set of sentences, generating a set of predicted features by using an encoder machine learning (ML) model that performs cross-attention between the set of masked-textual features and the set of masked-visual features for the sentence; and pretraining a document-encoder ML model based on the set of predicted features for each sentence of the set of sentences and one or more pretraining tasks including masked sentence modeling, visual contrastive learning, and visual-language alignment to enforce alignment between text and image regions, and jointly pretraining, in association with pretraining the document-encoder ML, the convolution model that generates visual embeddings for semantic regions. 11. The method of claim 10 , wherein generating a textual embedding for a sentence of the set of sentences comprises: generating a sentence embedding for the sentence by using the sentence encoding model and a multiset of tokens included in the sentence; generating a position embedding for the corresponding bounding box based on a position, within the document, of the corresponding bounding box; and generating the textual embedding for the sentence based on a combination of the sentence embedding and the position embedding for the bounding box. 12. The method of claim 10 , wherein generating a corresponding visual embedding for a sentence of the set of sentences comprises: generating a position embedding for the corresponding bounding box based on a position, within the document, of the corresponding bounding box; generating a region-of-interest (ROI) embedding for the corresponding bounding box by using the convolution model and the portion of the image data associated with the corresponding bounding box; and generating the corresponding visual embedding for the sentence based on a combination of the ROI embedding and the position embedding for the bounding box. 13. The method of claim 12 , wherein the actions further comprise: for each sentence of the set of sentences, generating the set of predicted features based further on the position embedding for the bounding box. 14. The method of claim 10 , wherein the actions further comprise: for each sentence of the set of sentences, generating a corresponding set of visual representations by using a vector quantization method to disc

Assignees

Inventors

Classifications

  • of extracted features · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Machine learning · CPC title

  • G06F40/30Primary

    Semantic analysis · CPC title

  • Neural networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12333845B2 cover?
The technology described includes methods for pretraining a document encoder model based on multimodal self cross-attention. One method includes receiving image data that encodes a set of pretraining documents. A set of sentences is extracted from the image data. A bounding box for each sentence is generated. For each sentence, a set of predicted features is generated by using an encoder machin…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 17 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).