Document analysis using model intersections
US-2022245378-A1 · Aug 4, 2022 · US
US11568276B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11568276-B1 |
| Application number | US-202117411534-A |
| Country | US |
| Kind code | B1 |
| Filing date | Aug 25, 2021 |
| Priority date | Aug 25, 2021 |
| Publication date | Jan 31, 2023 |
| Grant date | Jan 31, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An approach is provided in which a method, system, and program create a plurality of page clusters in feature space from a plurality of feature vectors corresponding to a plurality of unstructured pages. The method, system, and program product assign one of a plurality of machine learning models to each one of the plurality of page clusters based on a relationship in the feature space between the plurality of page clusters and a plurality of training clusters corresponding to the plurality of machine learning models. The method, system, and program product identify one of the plurality of page clusters that corresponds to a selected one of the plurality of unstructured pages, and transform the selected unstructured page into a structured page using a selected one of the plurality of machine learning models assigned to the identified page cluster.
Opening claim text (preview).
The invention claimed is: 1. A computer-implemented method comprising: creating a plurality of page clusters in feature space from a plurality of feature vectors corresponding to a plurality of unstructured pages; assigning one of a plurality of machine learning models to each one of the plurality of page clusters based on a relationship in the feature space between the plurality of page clusters and a plurality of training clusters corresponding to the plurality of machine learning models; identifying one of the plurality of page clusters that corresponds to a selected one of the plurality of unstructured pages; and transforming the selected unstructured page into a structured page using a selected one of the plurality of machine learning models assigned to the identified page cluster. 2. The method of claim 1 further comprising: dividing a plurality of unstructured documents into the plurality of unstructured pages; selecting one of the plurality of unstructured pages; defining a set of character areas and a corresponding set of positions in the selected unstructured page; and computing a set of character area feature vectors corresponding to the set of character areas based on their corresponding set of positions and a set of content within their corresponding character area. 3. The method of claim 2 further comprising: computing a selected one of the plurality of feature vectors for the selected unstructured page based on the set of character area feature vectors; and mapping the selected feature vector to the feature space. 4. The method of claim 3 further comprising: performing hierarchical clustering on the selected feature vector, wherein the hierarchical clustering further comprises: identifying one of a plurality of page cluster centers corresponding to the plurality of page clusters that is closest in feature space to the selected feature vector; and adding the selected feature vector to an identified one of the plurality of page clusters corresponding to the identified page cluster center. 5. The method of claim 1 further comprising: computing a plurality of page cluster centers based on the plurality of page clusters; computing a plurality of training cluster centers based on the plurality of training clusters; selecting one of the plurality of page cluster centers; identifying one of the plurality of training cluster centers closest to the selected page cluster center in the feature space; and assigning one of the plurality of machine learning models that corresponds to the identified training center cluster to the page cluster corresponding to the selected page cluster center. 6. The method of claim 1 further comprising: identifying a different one of the plurality of page clusters that corresponds to a different one of the plurality of unstructured pages; and transforming the different unstructured page into a different structured page using a different one of the plurality of machine learning models assigned to the different page cluster. 7. The method of claim 1 further comprising: training the selected machine learning model using a portion of the plurality of unstructured documents corresponding to the identified page cluster; performing the transforming using the trained machine learning model; and adding the trained machine learning model to the plurality of machine learning models. 8. The method of claim 1 wherein the plurality of unstructured pages comprises a plurality of unstructured page types, and wherein each one of the plurality of unstructured page types is assigned one of the plurality of machine learning models to perform the transforming. 9. An information handling system comprising: one or more processors; a memory coupled to at least one of the processors; a set of computer program instructions stored in the memory and executed by at least one of the processors in order to perform actions of: creating a plurality of page clusters in feature space from a plurality of feature vectors corresponding to a plurality of unstructured pages; assigning one of a plurality of machine learning models to each one of the plurality of page clusters based on a relationship in the feature space between the plurality of page clusters and a plurality of training clusters corresponding to the plurality of machine learning models; identifying one of the plurality of page clusters that corresponds to a selected one of the plurality of unstructured pages; and transforming the selected unstructured page into a structured page using a selected one of the plurality of machine learning models assigned to the identified page cluster. 10. The information handling system of claim 9 wherein the processors perform additional actions comprising: dividing a plurality of unstructured documents into the plurality of unstructured pages; selecting one of the plurality of unstructured pages; defining a set of character areas and a corresponding set of positions in the selected unstructured page; and computing a set of character area feature vectors corresponding to the set of character areas based on their corresponding set of positions and a set of content within their corresponding character area. 11. The information handling system of claim 10 wherein the processors perform additional actions comprising: computing a selected one of the plurality of feature vectors for the selected unstructured page based on the set of character area feature vectors; and mapping the selected feature vector to the feature space. 12. The information handling system of claim 11 wherein the processors perform additional actions comprising: performing hierarchical clustering on the selected feature vector, wherein the hierarchical clustering further comprises: identifying one of a plurality of page cluster centers corresponding to the plurality of page clusters that is closest in feature space to the selected feature vector; and adding the selected feature vector to an identified one of the plurality of page clusters corresponding to the identified page cluster center. 13. The information handling system of claim 9 wherein the processors perform additional actions comprising: computing a plurality of page cluster centers based on the plurality of page clusters; computing a plurality of training cluster centers based on the plurality of training clusters; selecting one of the plurality of page cluster centers; identifying one of the plurality of training cluster centers closest to the selected page cluster center in the feature space; and assigning one of the plurality of machine learning models that corresponds to the identified training center cluster to the page cluster corresponding to the selected page cluster center. 14. The information handling system of claim 9 wherein the processors perform additional actions comprising: identifying a different one of the plurality of page clusters that corresponds to a different one of the plurality of unstructured pages; and transforming the different unstructured page into a different structured page using a different one of the plurality of machine learning models assigned to the different page cluster. 15. The information handling system of claim 9 wherein the processors perform additional actions comprising: training the selected machine learning model using a portion of the plurality of unstructured documents corresponding to the identified page cluster; performing the transforming using the trained machine learning model; and adding the trained machine learning model to the plurality of machine lea
Related publications grouped by family.
Answers are generated from the same data shown on this page.