Intent detection
US-12182524-B2 · Dec 31, 2024 · US
US12333845B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12333845-B2 |
| Application number | US-202117528061-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 16, 2021 |
| Priority date | Nov 16, 2021 |
| Publication date | Jun 17, 2025 |
| Grant date | Jun 17, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The technology described includes methods for pretraining a document encoder model based on multimodal self cross-attention. One method includes receiving image data that encodes a set of pretraining documents. A set of sentences is extracted from the image data. A bounding box for each sentence is generated. For each sentence, a set of predicted features is generated by using an encoder machine-learning model. The encoder model performs cross-attention between a set of masked-textual features for the sentence and a set of masked-visual features for the sentence. The set of masked-textual features is based on a masking function and the sentence. The set of masked-visual features is based on the masking function and the corresponding bounding box. A document-encoder model is pretrained based on the set of predicted features for each sentence and pretraining tasks. The pretraining tasks includes masked sentence modeling, visual contrastive learning, or visual-language alignment.
Opening claim text (preview).
What is claimed: 1. A non-transitory computer-readable storage medium having instructions stored thereon, which, when executed by a processor of a computing device cause the processor to perform actions comprising: receiving image data that encodes a document; extracting, from the image data, a set of sentences and a corresponding bounding box for each sentence of the set of sentences; for each sentence of the set of sentences, generating a set of predicted features using an encoder machine learning (ML) model that performs cross-attention between a set of masked-textual features for the sentence and a set of masked-visual features for the sentence, wherein the set of masked-textual features are based on a masking function and the sentence and the set of masked-visual features are based on the masking function and the corresponding bounding box for the sentence; and pretraining a document-encoder ML model based on the set of predicted features for each sentence of the set of sentences and one or more pretraining tasks including visual-language alignment to enforce alignment between text and image regions and jointly pretraining, in association with pretraining the document-encoder ML, an image encoder that derives visual features for semantic regions, wherein at least one visual feature comprises a table, a font size, a style, or a figure. 2. The computer-readable storage medium of claim 1 , wherein the actions further comprise: for each sentence of the set of sentences, generating a textual embedding by using a sentence encoder model and a corresponding visual embedding by using a convolution model and a portion of the image data associated with the corresponding bounding box; and for each sentence of the set of sentences, generating the set of predicted features further based on the textual embedding for the sentence and the corresponding visual embedding. 3. The computer-readable storage medium of claim 2 , wherein the actions further comprise: for each sentence of the set of sentences, generating the set of masked-textual features and the set of masked-visual features by using the masking function, the textual embedding for the sentence, and the corresponding visual embedding. 4. The computer-readable storage medium of claim 2 , wherein generating a textual embedding for a sentence of the set of sentences comprises: generating a sentence embedding for the sentence by using the sentence encoding model and a multiset of tokens included in the sentence; generating a position embedding for the corresponding bounding box based on a position, within the document, of the corresponding bounding box; and generating the textual embedding for the sentence by using a combination of the sentence embedding and the position embedding for the bounding box. 5. The computer-readable storage medium of claim 2 , wherein generating a corresponding visual embedding for a sentence of the set of sentences comprises: generating a position embedding for the corresponding bounding box based on a position, within the document, of the corresponding bounding box; generating a region-of-interest (ROI) embedding for the corresponding bounding box by using the convolution model and the portion of the image data associated with the corresponding bounding box; and generating the corresponding visual embedding for the sentence based on a combination of the ROI embedding and the position embedding for the bounding box. 6. The computer-readable storage medium of claim 5 , wherein the actions further comprise: for each sentence of the set of sentences, generating the set of predicted features based further on the position embedding for the bounding box. 7. The computer-readable storage medium of claim 2 , wherein the actions further comprise: for each sentence of the set of sentences, generating a corresponding set of visual representations by using a vector quantization method to discretize the corresponding visual embedding; and for each sentence of the set of sentences, generating the set of masked-visual features by applying the visual mask on the corresponding set of visual representations. 8. The one or more computer-readable storage media of claim 2 , wherein generating the set of masked-textual features and the set of masked-visual features is further based on the masking function stochastically masking the textual embedding for the sentence and the corresponding visual embedding. 9. The one or more computer-readable storage media of claim 1 , wherein the one or more pretraining tasks includes at least one of masked sentence modeling or a visual contrastive learning. 10. A method comprising: receiving image data that encodes a document; extracting, from the image data, a set of sentences and a corresponding bounding box for each sentence of the set of sentences; for each sentence of the set of sentences, generating a textual embedding by using a sentence encoder model and a corresponding visual embedding by using a convolution model and a portion of the image data associated with the corresponding bounding box; for each sentence of the set of sentences, generating a set of masked-textual features and a set of masked-visual features by using a masking function, the textual embedding for the sentence, and the corresponding visual embedding; for each sentence of the set of sentences, generating a set of predicted features by using an encoder machine learning (ML) model that performs cross-attention between the set of masked-textual features and the set of masked-visual features for the sentence; and pretraining a document-encoder ML model based on the set of predicted features for each sentence of the set of sentences and one or more pretraining tasks including masked sentence modeling, visual contrastive learning, and visual-language alignment to enforce alignment between text and image regions, and jointly pretraining, in association with pretraining the document-encoder ML, the convolution model that generates visual embeddings for semantic regions. 11. The method of claim 10 , wherein generating a textual embedding for a sentence of the set of sentences comprises: generating a sentence embedding for the sentence by using the sentence encoding model and a multiset of tokens included in the sentence; generating a position embedding for the corresponding bounding box based on a position, within the document, of the corresponding bounding box; and generating the textual embedding for the sentence based on a combination of the sentence embedding and the position embedding for the bounding box. 12. The method of claim 10 , wherein generating a corresponding visual embedding for a sentence of the set of sentences comprises: generating a position embedding for the corresponding bounding box based on a position, within the document, of the corresponding bounding box; generating a region-of-interest (ROI) embedding for the corresponding bounding box by using the convolution model and the portion of the image data associated with the corresponding bounding box; and generating the corresponding visual embedding for the sentence based on a combination of the ROI embedding and the position embedding for the bounding box. 13. The method of claim 12 , wherein the actions further comprise: for each sentence of the set of sentences, generating the set of predicted features based further on the position embedding for the bounding box. 14. The method of claim 10 , wherein the actions further comprise: for each sentence of the set of sentences, generating a corresponding set of visual representations by using a vector quantization method to disc
of extracted features · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Machine learning · CPC title
Semantic analysis · CPC title
Neural networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.