Document structure extraction using machine learning

US11769072B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11769072-B2
Application numberUS-201615231294-A
CountryUS
Kind codeB2
Filing dateAug 8, 2016
Priority dateAug 8, 2016
Publication dateSep 26, 2023
Grant dateSep 26, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The structure of an untagged document can be derived using a predictive model that is trained in a supervised learning framework based on a corpus of tagged training documents. Analyzing the training documents results in a plurality of document part feature vectors, each of which correlates a category defining a document part (for example, “title” or “body paragraph”) with one or more feature-value pairs (for example, “font=Arial” or “alignment=centered”). Any suitable machine learning algorithm can be used to train the predictive model based on the document part feature vectors extracted from the training documents. Once the predictive model has been trained, it can receive feature-value pairs corresponding to a portion of an untagged document and make predictions with respect to the how that document part should be categorized. The predictive model can therefore generate tag metadata that defines a structure of the untagged document in an automated fashion.

First claim

Opening claim text (preview).

What is claimed is: 1. A document structure extraction method comprising: receiving, by a document structure analytics server, an untagged document that comprises a plurality of document parts, wherein certain of the document parts have a visual appearance that is defined by formatting information included in the untagged document; receiving, by the document structure analytics server, a command to generate a table of contents for the untagged document; in response to receiving the command to generate the table of contents, invoking a document tagging process that comprises: identifying a document type category to which the untagged document belongs; extracting at least a portion of the formatting information from the untagged document; for each of two or more of the plurality of document parts, generating one or more feature-value pairs using the extracted formatting information, wherein each of the generated feature-value pairs characterizes the visual appearance of the corresponding document part by associating a particular value with a particular formatting feature; making a selection of a particular predictive model, from amongst a plurality of predictive models hosted by the document structure analytics server, wherein the selection is made based on the particular predictive model having been trained using a corpus of tagged training documents belonging to the identified document type category to which the untagged document belongs, and wherein each of the predictive models is configured to categorize document parts for documents sharing a common document type categorization for a respective predictive model; using the particular predictive model to predict a categorization for each of the two or more document parts that form part of the untagged document based on the corresponding one or more feature-value pairs, wherein the particular predictive model applies a machine learning algorithm to make predictions based on a collection of categorized feature-value pairs aggregated from, and characterizing document parts included in, the corpus of tagged training documents belonging to the identified document type category; and defining tag metadata that associates each of the two or more document parts with the corresponding predicted categorization generated by the particular predictive model; generating the table of contents based on the defined tag metadata, wherein the table of contents correlates a document part identified as a heading by the particular predictive model with a location of the heading within the untagged document; and modifying the untagged document to include the generated table of contents. 2. The document structure extraction method of claim 1 , wherein one of the generated feature-value pairs associates a font size formatting feature with a particular font size value. 3. The document structure extraction method of claim 1 , wherein the untagged document is received from a client computing device; and the method further comprises applying the tag metadata to the untagged document to produce a tagged document that includes the table of contents, and sending the tagged document that includes the table of contents to the client computing device. 4. The document structure extraction method of claim 1 , wherein one of the generated feature-value pairs associates a font size formatting feature with a particular value that is selected from a group consisting of a largest font in the untagged document, an intermediate-sized font in the untagged document, and a smallest font in the untagged document. 5. The document structure extraction method of claim 1 , wherein one of the generated feature-value pairs associates a font size formatting feature with a particular value that is selected from a group consisting of a font size that is larger than a preceding paragraph, a font size that is smaller than the preceding paragraph, a font size that is larger than a following paragraph, and a font size that is smaller than the following paragraph. 6. The document structure extraction method of claim 1 , wherein the particular value defines the particular formatting feature in relation to a formatting feature for a second document part. 7. The document structure extraction method of claim 1 , wherein the particular value is selected from a group consisting of left justification, center justification, right justification, and full justification; and the particular formatting feature is a paragraph alignment formatting feature. 8. The document structure extraction method of claim 1 , the document tagging process further comprising using the particular predictive model to determine a confidence level in the categorization for at least some of the two or more document parts that form part of the untagged document. 9. The document structure extraction method of claim 1 , wherein receiving the untagged document further comprises receiving, from a document viewer executing on a client computing device, the plurality of document parts and the formatting information. 10. The document structure extraction method of claim 1 , wherein receiving the untagged document further comprises receiving, by the document structure analytics server, a plurality of untagged documents from a document management system. 11. The document structure extraction method of claim 1 , further comprising embedding the tag metadata into the untagged document to produce a tagged document that also includes the table of contents. 12. The document structure extraction method of claim 1 , further comprising embedding the tag metadata into the untagged document to produce a tagged document that also includes the table of contents, and sending the tagged document to a client computing device. 13. The document structure extraction method of claim 1 , further comprising modifying the untagged document such that the visual appearance of at least some of the two or more document parts is further defined by the predicted categorization generated by the particular predictive model. 14. A non-transitory computer readable medium encoded with instructions that, when executed by one or more processors, cause a document structure analysis process to be invoked, the process comprising: identifying a plurality of training documents, each of which is associated with a particular document type category; accessing a particular one of the training documents, the particular training document comprising a plurality of document parts, wherein a particular one of the document parts has (a) a visual appearance defined by formatting information included in the particular training document, and (b) a document part categorization; generating, for the particular document part, one or more feature-value pairs using the formatting information, wherein each of the generated one or more feature-value pairs characterizes the visual appearance of the particular document part by correlating a particular value with a particular formatting feature, wherein a particular one of the generated feature-value pairs defines a proportion of content comprising the particular training document having a particular visual appearance; defining a document part feature vector that links the generated one or more feature-value pairs with the document part categorization, wherein the document part feature vector links a feature-value pair that correlates a document part comprising 90% or more of document content with a body paragraph categorization, and a feature-value pair that correlates a document part comprising less than 0.1% of document content with a title categorization; storing the document part

Assignees

Inventors

Classifications

  • G06N20/00Primary

    Machine learning · CPC title

  • Distances to prototypes · CPC title

  • Knowledge representation; Symbolic representation · CPC title

  • Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title

  • Tagging; Marking up (details of markup languages G06F40/143); Designating a block; Setting of attributes (style sheets, e.g. eXtensible Stylesheet Language Transformation [XSLT], G06F40/154) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11769072B2 cover?
The structure of an untagged document can be derived using a predictive model that is trained in a supervised learning framework based on a corpus of tagged training documents. Analyzing the training documents results in a plurality of document part feature vectors, each of which correlates a category defining a document part (for example, “title” or “body paragraph”) with one or more feature-v…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 26 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).