System And Method For Extracting Structured Information From Implicit Tables
US-2020073878-A1 · Mar 5, 2020 · US
US12056948B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12056948-B2 |
| Application number | US-202117379154-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 19, 2021 |
| Priority date | Jul 19, 2021 |
| Publication date | Aug 6, 2024 |
| Grant date | Aug 6, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In an approach, a processor identifies a plurality of text separators in a borderless table, a text separator of the plurality of text separators defining a non-text region between two consecutive text lines in the borderless table. A processor classifies the plurality of text separators into a number of target clusters comprised in a target group based on property information related to the plurality of text separators, the number of target clusters corresponding to a number of separator types. A processor provides indication information to indicate respective separator types of the plurality of text separators based on a result of the classifying.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: identifying, by one or more processors, a plurality of text separators in a borderless table, a text separator of the plurality of text separators defining a non-text region between two consecutive text lines in the borderless table; classifying, by one or more processors, each text separator of the plurality of text separators into one of a plurality of target clusters, each target cluster corresponding to property information of a separator type; selecting, by one or more processors, a target group that includes the plurality of target clusters based on: determining, by one or more processors, whether each text separator of the plurality of separators separates text lines that meet a similarity threshold; and determining, by one or more processors, an accuracy level for the target group, the accuracy level based on a distribution of the text separators within the plurality of target clusters and the corresponding similarity threshold designation for each respective text separator within each respective target cluster; and providing, by one or more processors, indication information to indicate respective separator types of the plurality of text separators based on a result of the classifying. 2. The computer-implemented method of claim 1 , wherein the property information comprises information about a selection of visual properties of the non-text regions defined by the plurality of text separators from the group consisting of: lines, bold lines, and dashed lines. 3. The computer-implemented method of claim 1 , wherein the target group is selected from a plurality of candidate groups comprising different numbers of candidate clusters, and selecting the target group further comprises: for each of a plurality of candidate groups comprising different numbers of candidate clusters: classifying, by one or more processors, each text separator of the plurality of text separators into a certain number of candidate clusters comprised in the candidate group; and determining, by one or more processors, an overall accuracy level for the candidate group based on respective distributions of the similarities for text separators classified in the certain number of candidate clusters; and selecting, by one or more processors, the target group from the plurality of candidate groups based on the overall accuracy levels determined for the plurality of candidate groups. 4. The computer-implemented method of claim 3 , wherein determining the overall accuracy level for the candidate group comprises: for each of the certain number of candidate clusters comprised in the candidate group, determining, by one or more processors, a distribution of the similarities for the text separators classified into the candidate cluster by: determining, by one or more processors, a first count of text separators that are classified into the candidate cluster and have the similarities above a predetermined threshold; and determining, by one or more processors, a second count of text separators that are classified into the candidate cluster and have the similarities below the predetermined threshold; for each of the certain number of candidate clusters, determining, by one or more processors, a cluster accuracy level for the candidate cluster based on the first count and the second count; and calculating, by one or more processors, the overall accuracy level for the candidate group by aggregating the cluster accuracy levels determined for the certain number of candidate clusters. 5. The computer-implemented method of claim 4 , wherein determining the cluster accuracy level for the candidate cluster based on the first count and the second count comprises: calculating, by one or more processors, a ratio of a higher one of the first and second counts to a sum of the first and second counts; and determining, by one or more processors, the cluster accuracy level based on the ratio. 6. The computer-implemented method of claim 3 , wherein selecting the target group from the plurality of candidate groups comprises: sorting, by one or more processors, the overall accuracy levels for the plurality of candidate groups; and selecting, by one or more processors, a candidate group with a highest overall accuracy level from the plurality of candidate groups, as the target group. 7. The computer-implemented method of claim 6 , wherein selecting a candidate group with a highest overall accuracy level further comprises selecting, by one or more processor, a candidate group comprising a lowest number of candidate clusters. 8. The computer-implemented method of claim 1 , further comprising: comparing, by one or more processor, the property information of the text separators classified into respective clusters of the plurality of target clusters with reference property information; and assigning, by one or more processor, the plurality of target clusters to be corresponding to each separator type, respectively, based on a result of the comparing. 9. The computer-implemented method of claim 1 , wherein providing the indication information comprises: assigning, by one or more processor, a first separator type to at least one of the plurality of text separators classified in a first cluster of the plurality of target clusters, the first target cluster aligned with the first separator type; assigning, by one or more processor, a second separator type to at least one of the plurality of text separators classified in a second cluster of the plurality of target clusters, the second target cluster aligned with the second separator type; and providing, by one or more processor, the indication information to at least indicate the first and second separator types assigned to the text separators classified in the first and second target clusters. 10. A computer program product comprising: one or more computer readable storage devices, and program instructions collectively stored on the one or more computer readable storage devices, the program instructions comprising: program instructions to identify a plurality of text separators in a borderless table, a text separator of the plurality of text separators defining a non-text region between two consecutive text lines in the borderless table; program instructions to classify each text separator of the plurality of text separators into one of a plurality of target clusters, each target cluster corresponding to property information of a separator type; program instructions to select a target group that includes the plurality of target clusters based on: determining whether each text separator of the plurality of separators separates text lines that meet a similarity threshold; and determining an accuracy level for the target group, the accuracy level based on a distribution of the text separators within the plurality of target clusters and the corresponding similarity threshold designation for each respective text separator within each respective target cluster; and program instructions to provide indication information to indicate respective separator types of the plurality of text separators based on a result of the classifying. 11. The computer program product of claim 10 , wherein the property information comprises information about a selection of visual properties of the non-text regions defined by the plurality of text separators from the group consisting of: lines, bold lines, and dashed lines. 12. The computer program product of claim 10 , wherein the target group is selected from a plurality of candidate groups comprising different numbers of candidate clusters, and selecting the target gro
Clustering techniques · CPC title
Character recognition · CPC title
Classification of content, e.g. text, photographs or tables · CPC title
using character size, text spacings or pitch estimation · CPC title
Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.