Knowledge-based question answering system for the diy domain
US-2020110835-A1 · Apr 9, 2020 · US
US12374141B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12374141-B2 |
| Application number | US-202017926996-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 12, 2020 |
| Priority date | Jun 12, 2020 |
| Publication date | Jul 29, 2025 |
| Grant date | Jul 29, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
There is provided a solution for semantic representation of text in a document. In this solution, textual information comprising a sequence of text elements ( 220 ) and layout information ( 230 ) of the text element are determined from a document. The layout information ( 230 ) indicates a spatial arrangement of the plurality of text elements ( 220 ) presented within the document. Based at least in part on the plurality of text elements ( 220 ) and the layout information ( 230 ), respective semantic feature representations ( 180 ) of the plurality of text elements ( 220 ) are generated. By jointly using both the textual information and the layout information ( 230 ), rich semantics of the text elements ( 220 ) in the document can be effectively captured in the feature representations.
Opening claim text (preview).
What is claimed is: 1. A device for determining a semantic representation of text in a document, comprising: a processing unit; and a memory coupled to the processing unit and having instructions stored thereon, the instructions, when executed by the processing unit, causing the device to perform acts comprising: determining textual information presented in the document, the textual information comprising a plurality of text elements; determining layout information indicating a spatial arrangement of the plurality of text elements presented within the document; generating respective semantic feature representations of the plurality of text elements based at least in part on the plurality of text elements and the layout information; determining, using a visual information processing system, visual information of the textual information, the visual information indicating at least one: respective visual appearances of the plurality of text elements presented in the document, and an overall visual appearance of the document; combining the respective semantic feature representations of the plurality of text elements with the visual information as an input to a decoder; and performing, using the decoder, a downstream processing task for document understanding based on the respective semantic feature representations of the plurality of text elements and the visual information, the document understanding comprising form understanding, receipt understanding, and document classification; wherein form understanding comprises extracting and structuring the textual content of forms; wherein receipt understanding comprises filling several pre-defined semantic slots according to the document; and wherein document classification is to predict the corresponding category for each document and assign one or more categorical labels to the document. 2. The device of claim 1 , wherein the layout information indicates at least one of the following: respective positions of the plurality of text elements within the document, and a positioning range of the textual information within the document. 3. The device of claim 2 , wherein the document comprises an image and the image comprises the plurality of text elements, and wherein the layout information comprises the respective positions of the plurality of text elements, and determining the layout information comprises: determining a plurality of bounding boxes bounding the plurality of text elements in the image; and determining respective positions of the plurality of bounding boxes in the image as the respective positions of the plurality of text elements. 4. The device of claim 1 , wherein the acts further comprise: determining visual information indicating at least one of the following: respective visual appearances of the plurality of text elements presented in the document, and an overall visual appearance of the document; and wherein generating the semantic feature representations further comprises: generating the semantic feature representations further based on the visual information. 5. The device of claim 4 , wherein the visual information comprises at least one of the following: information of respective formats of the plurality of text elements, and information of a format of the document. 6. The device of claim 4 , wherein the visual information indicates the respective visual appearances, and determining the visual information comprises: extracting a plurality of image blocks presenting the plurality of text elements in the document; and generating a plurality of visual feature representations characterizing the visual appearances of the plurality of image blocks. 7. The device of claim 1 , wherein generating the semantic feature representations comprises: determining the semantic feature representations by applying the plurality of text elements and the layout information as inputs to a neural network. 8. The device of claim 7 , wherein the neural network is pre-trained based on a plurality of sample text elements in a sample image and sample layout information indicating a layout of the plurality of sample text elements presented within the sample image, and wherein the pre-training of the neural network is performed by: masking at least one of the plurality of sample text elements, and training the neural network to predict the at least one masked sample text element given remaining ones of the plurality of sample text elements and the sample layout information. 9. A computer-implemented method for determining a semantic representation of text in a document comprising: determining textual information presented in the document, the textual information comprising a plurality of text elements; determining layout information indicating a spatial arrangement of the plurality of text elements presented within the document; generating respective semantic feature representations of the plurality of text elements based at least in part on the plurality of text elements and the layout information; determining, using a visual information processing system, visual information of the textual information, the visual information indicating at least one: respective visual appearances of the plurality of text elements presented in the document, and an overall visual appearance of the document; combining the respective semantic feature representations of the plurality of text elements with the visual information as an input to a decoder; and performing, using the decoder, a downstream processing task for document understanding based on the respective semantic feature representations of the plurality of text elements and the visual information, the document understanding comprising form understanding, receipt understanding, and document classification; wherein form understanding comprises extracting and structuring the textual content of forms; wherein receipt understanding comprises filling several pre-defined semantic slots according to the document; and wherein document classification is to predict the corresponding category for each document and assign one or more categorical labels to the document. 10. The method of claim 9 , wherein the layout information indicates at least one of the following: respective positions of the plurality of text elements within the document, and a positioning range of the textual information within the document. 11. The method of claim 10 , wherein the document comprises an image and the image comprises the plurality of text elements, and wherein the layout information comprises the respective positions of the plurality of text elements, and determining the layout information comprises: determining a plurality of bounding boxes bounding the plurality of text elements in the image; and determining respective positions of the plurality of bounding boxes in the image as the respective positions of the plurality of text elements. 12. The method of claim 9 , further comprising: determining visual information indicating at least one of the following: respective visual appearances of the plurality of text elements presented in the document, and an overall visual appearance of the document; and wherein generating the semantic feature representations further comprises: generating the semantic feature representations further based on the visual information. 13. The method of claim 9 , wherein the decoder comprises one or more aggregation layers. 14. The method of claim 9 , wherein the decoder outputs an indication of a predefined label assigned to a text element. 15. The method of claim 9 , further comprising: a
based on markings or identifiers characterising the document or the area · CPC title
by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis · CPC title
Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title
using neural networks · CPC title
Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.