Adaptive document understanding

US11568276B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11568276-B1
Application numberUS-202117411534-A
CountryUS
Kind codeB1
Filing dateAug 25, 2021
Priority dateAug 25, 2021
Publication dateJan 31, 2023
Grant dateJan 31, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An approach is provided in which a method, system, and program create a plurality of page clusters in feature space from a plurality of feature vectors corresponding to a plurality of unstructured pages. The method, system, and program product assign one of a plurality of machine learning models to each one of the plurality of page clusters based on a relationship in the feature space between the plurality of page clusters and a plurality of training clusters corresponding to the plurality of machine learning models. The method, system, and program product identify one of the plurality of page clusters that corresponds to a selected one of the plurality of unstructured pages, and transform the selected unstructured page into a structured page using a selected one of the plurality of machine learning models assigned to the identified page cluster.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method comprising: creating a plurality of page clusters in feature space from a plurality of feature vectors corresponding to a plurality of unstructured pages; assigning one of a plurality of machine learning models to each one of the plurality of page clusters based on a relationship in the feature space between the plurality of page clusters and a plurality of training clusters corresponding to the plurality of machine learning models; identifying one of the plurality of page clusters that corresponds to a selected one of the plurality of unstructured pages; and transforming the selected unstructured page into a structured page using a selected one of the plurality of machine learning models assigned to the identified page cluster. 2. The method of claim 1 further comprising: dividing a plurality of unstructured documents into the plurality of unstructured pages; selecting one of the plurality of unstructured pages; defining a set of character areas and a corresponding set of positions in the selected unstructured page; and computing a set of character area feature vectors corresponding to the set of character areas based on their corresponding set of positions and a set of content within their corresponding character area. 3. The method of claim 2 further comprising: computing a selected one of the plurality of feature vectors for the selected unstructured page based on the set of character area feature vectors; and mapping the selected feature vector to the feature space. 4. The method of claim 3 further comprising: performing hierarchical clustering on the selected feature vector, wherein the hierarchical clustering further comprises: identifying one of a plurality of page cluster centers corresponding to the plurality of page clusters that is closest in feature space to the selected feature vector; and adding the selected feature vector to an identified one of the plurality of page clusters corresponding to the identified page cluster center. 5. The method of claim 1 further comprising: computing a plurality of page cluster centers based on the plurality of page clusters; computing a plurality of training cluster centers based on the plurality of training clusters; selecting one of the plurality of page cluster centers; identifying one of the plurality of training cluster centers closest to the selected page cluster center in the feature space; and assigning one of the plurality of machine learning models that corresponds to the identified training center cluster to the page cluster corresponding to the selected page cluster center. 6. The method of claim 1 further comprising: identifying a different one of the plurality of page clusters that corresponds to a different one of the plurality of unstructured pages; and transforming the different unstructured page into a different structured page using a different one of the plurality of machine learning models assigned to the different page cluster. 7. The method of claim 1 further comprising: training the selected machine learning model using a portion of the plurality of unstructured documents corresponding to the identified page cluster; performing the transforming using the trained machine learning model; and adding the trained machine learning model to the plurality of machine learning models. 8. The method of claim 1 wherein the plurality of unstructured pages comprises a plurality of unstructured page types, and wherein each one of the plurality of unstructured page types is assigned one of the plurality of machine learning models to perform the transforming. 9. An information handling system comprising: one or more processors; a memory coupled to at least one of the processors; a set of computer program instructions stored in the memory and executed by at least one of the processors in order to perform actions of: creating a plurality of page clusters in feature space from a plurality of feature vectors corresponding to a plurality of unstructured pages; assigning one of a plurality of machine learning models to each one of the plurality of page clusters based on a relationship in the feature space between the plurality of page clusters and a plurality of training clusters corresponding to the plurality of machine learning models; identifying one of the plurality of page clusters that corresponds to a selected one of the plurality of unstructured pages; and transforming the selected unstructured page into a structured page using a selected one of the plurality of machine learning models assigned to the identified page cluster. 10. The information handling system of claim 9 wherein the processors perform additional actions comprising: dividing a plurality of unstructured documents into the plurality of unstructured pages; selecting one of the plurality of unstructured pages; defining a set of character areas and a corresponding set of positions in the selected unstructured page; and computing a set of character area feature vectors corresponding to the set of character areas based on their corresponding set of positions and a set of content within their corresponding character area. 11. The information handling system of claim 10 wherein the processors perform additional actions comprising: computing a selected one of the plurality of feature vectors for the selected unstructured page based on the set of character area feature vectors; and mapping the selected feature vector to the feature space. 12. The information handling system of claim 11 wherein the processors perform additional actions comprising: performing hierarchical clustering on the selected feature vector, wherein the hierarchical clustering further comprises: identifying one of a plurality of page cluster centers corresponding to the plurality of page clusters that is closest in feature space to the selected feature vector; and adding the selected feature vector to an identified one of the plurality of page clusters corresponding to the identified page cluster center. 13. The information handling system of claim 9 wherein the processors perform additional actions comprising: computing a plurality of page cluster centers based on the plurality of page clusters; computing a plurality of training cluster centers based on the plurality of training clusters; selecting one of the plurality of page cluster centers; identifying one of the plurality of training cluster centers closest to the selected page cluster center in the feature space; and assigning one of the plurality of machine learning models that corresponds to the identified training center cluster to the page cluster corresponding to the selected page cluster center. 14. The information handling system of claim 9 wherein the processors perform additional actions comprising: identifying a different one of the plurality of page clusters that corresponds to a different one of the plurality of unstructured pages; and transforming the different unstructured page into a different structured page using a different one of the plurality of machine learning models assigned to the different page cluster. 15. The information handling system of claim 9 wherein the processors perform additional actions comprising: training the selected machine learning model using a portion of the plurality of unstructured documents corresponding to the identified page cluster; performing the transforming using the trained machine learning model; and adding the trained machine learning model to the plurality of machine lea

Assignees

Inventors

Classifications

  • G06N5/022Primary

    Knowledge engineering; Knowledge acquisition · CPC title

  • G06F16/93Primary

    Document management systems · CPC title

  • Clustering or classification · CPC title

  • Data format conversion from or to a database · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11568276B1 cover?
An approach is provided in which a method, system, and program create a plurality of page clusters in feature space from a plurality of feature vectors corresponding to a plurality of unstructured pages. The method, system, and program product assign one of a plurality of machine learning models to each one of the plurality of page clusters based on a relationship in the feature space between t…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N5/022. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 31 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).