Semantic representation of text in document

US12374141B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12374141-B2
Application numberUS-202017926996-A
CountryUS
Kind codeB2
Filing dateJun 12, 2020
Priority dateJun 12, 2020
Publication dateJul 29, 2025
Grant dateJul 29, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

There is provided a solution for semantic representation of text in a document. In this solution, textual information comprising a sequence of text elements ( 220 ) and layout information ( 230 ) of the text element are determined from a document. The layout information ( 230 ) indicates a spatial arrangement of the plurality of text elements ( 220 ) presented within the document. Based at least in part on the plurality of text elements ( 220 ) and the layout information ( 230 ), respective semantic feature representations ( 180 ) of the plurality of text elements ( 220 ) are generated. By jointly using both the textual information and the layout information ( 230 ), rich semantics of the text elements ( 220 ) in the document can be effectively captured in the feature representations.

First claim

Opening claim text (preview).

What is claimed is: 1. A device for determining a semantic representation of text in a document, comprising: a processing unit; and a memory coupled to the processing unit and having instructions stored thereon, the instructions, when executed by the processing unit, causing the device to perform acts comprising: determining textual information presented in the document, the textual information comprising a plurality of text elements; determining layout information indicating a spatial arrangement of the plurality of text elements presented within the document; generating respective semantic feature representations of the plurality of text elements based at least in part on the plurality of text elements and the layout information; determining, using a visual information processing system, visual information of the textual information, the visual information indicating at least one: respective visual appearances of the plurality of text elements presented in the document, and an overall visual appearance of the document; combining the respective semantic feature representations of the plurality of text elements with the visual information as an input to a decoder; and performing, using the decoder, a downstream processing task for document understanding based on the respective semantic feature representations of the plurality of text elements and the visual information, the document understanding comprising form understanding, receipt understanding, and document classification; wherein form understanding comprises extracting and structuring the textual content of forms; wherein receipt understanding comprises filling several pre-defined semantic slots according to the document; and wherein document classification is to predict the corresponding category for each document and assign one or more categorical labels to the document. 2. The device of claim 1 , wherein the layout information indicates at least one of the following: respective positions of the plurality of text elements within the document, and a positioning range of the textual information within the document. 3. The device of claim 2 , wherein the document comprises an image and the image comprises the plurality of text elements, and wherein the layout information comprises the respective positions of the plurality of text elements, and determining the layout information comprises: determining a plurality of bounding boxes bounding the plurality of text elements in the image; and determining respective positions of the plurality of bounding boxes in the image as the respective positions of the plurality of text elements. 4. The device of claim 1 , wherein the acts further comprise: determining visual information indicating at least one of the following: respective visual appearances of the plurality of text elements presented in the document, and an overall visual appearance of the document; and wherein generating the semantic feature representations further comprises: generating the semantic feature representations further based on the visual information. 5. The device of claim 4 , wherein the visual information comprises at least one of the following: information of respective formats of the plurality of text elements, and information of a format of the document. 6. The device of claim 4 , wherein the visual information indicates the respective visual appearances, and determining the visual information comprises: extracting a plurality of image blocks presenting the plurality of text elements in the document; and generating a plurality of visual feature representations characterizing the visual appearances of the plurality of image blocks. 7. The device of claim 1 , wherein generating the semantic feature representations comprises: determining the semantic feature representations by applying the plurality of text elements and the layout information as inputs to a neural network. 8. The device of claim 7 , wherein the neural network is pre-trained based on a plurality of sample text elements in a sample image and sample layout information indicating a layout of the plurality of sample text elements presented within the sample image, and wherein the pre-training of the neural network is performed by: masking at least one of the plurality of sample text elements, and training the neural network to predict the at least one masked sample text element given remaining ones of the plurality of sample text elements and the sample layout information. 9. A computer-implemented method for determining a semantic representation of text in a document comprising: determining textual information presented in the document, the textual information comprising a plurality of text elements; determining layout information indicating a spatial arrangement of the plurality of text elements presented within the document; generating respective semantic feature representations of the plurality of text elements based at least in part on the plurality of text elements and the layout information; determining, using a visual information processing system, visual information of the textual information, the visual information indicating at least one: respective visual appearances of the plurality of text elements presented in the document, and an overall visual appearance of the document; combining the respective semantic feature representations of the plurality of text elements with the visual information as an input to a decoder; and performing, using the decoder, a downstream processing task for document understanding based on the respective semantic feature representations of the plurality of text elements and the visual information, the document understanding comprising form understanding, receipt understanding, and document classification; wherein form understanding comprises extracting and structuring the textual content of forms; wherein receipt understanding comprises filling several pre-defined semantic slots according to the document; and wherein document classification is to predict the corresponding category for each document and assign one or more categorical labels to the document. 10. The method of claim 9 , wherein the layout information indicates at least one of the following: respective positions of the plurality of text elements within the document, and a positioning range of the textual information within the document. 11. The method of claim 10 , wherein the document comprises an image and the image comprises the plurality of text elements, and wherein the layout information comprises the respective positions of the plurality of text elements, and determining the layout information comprises: determining a plurality of bounding boxes bounding the plurality of text elements in the image; and determining respective positions of the plurality of bounding boxes in the image as the respective positions of the plurality of text elements. 12. The method of claim 9 , further comprising: determining visual information indicating at least one of the following: respective visual appearances of the plurality of text elements presented in the document, and an overall visual appearance of the document; and wherein generating the semantic feature representations further comprises: generating the semantic feature representations further based on the visual information. 13. The method of claim 9 , wherein the decoder comprises one or more aggregation layers. 14. The method of claim 9 , wherein the decoder outputs an indication of a predefined label assigned to a text element. 15. The method of claim 9 , further comprising: a

Assignees

Inventors

Classifications

  • based on markings or identifiers characterising the document or the area · CPC title

  • by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis · CPC title

  • Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title

  • using neural networks · CPC title

  • Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12374141B2 cover?
There is provided a solution for semantic representation of text in a document. In this solution, textual information comprising a sequence of text elements ( 220 ) and layout information ( 230 ) of the text element are determined from a document. The layout information ( 230 ) indicates a spatial arrangement of the plurality of text elements ( 220 ) presented within the document. Based at leas…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 29 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).