Document segmentation, interpretation, and re-organization
US-2018225259-A1 · Aug 9, 2018 · US
US10318554B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10318554-B2 |
| Application number | US-201615246256-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 24, 2016 |
| Priority date | Jun 20, 2016 |
| Publication date | Jun 11, 2019 |
| Grant date | Jun 11, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
System and method for data cleansing are disclosed. The method comprises receiving one or more data records pre-categorized into one or more categories. Identifying at least one concept associated with one or more data records, and grouping, the at least one concept associated with the one or more data records into a plurality of category lists based on the predefined category associated with each of the one or more data records. Determining, one or more intersection sets based on a comparison between each pair of the plurality of category lists, wherein each intersection set comprises a set of one or more common concepts associated with a pair of category lists. The method comprises replacing each of at least one common concept of the set of one or more common concepts associated with each intersection set by a category name based on an occurrence rate of the common concepts.
Opening claim text (preview).
What is claimed is: 1. A method of data cleansing, the method comprising: receiving, by a data categorizer, one or more data records pre-categorized into one or more categories; identifying, by the data categorizer, at least one concept associated with the one or more data records; grouping, by the data categorizer, the at least one concept associated with the one or more data records into a plurality of category lists based on the pre-categorized one or more categories associated with each of the one or more data records; determining, by the data categorizer, one or more intersection sets based on a comparison between each pair of the plurality of category lists, wherein each intersection set comprises a set of one or more common concepts associated with a pair of category lists; and replacing, by the data categorizer, each of at least one common concept of the set of one or more common concepts associated with the each intersection set by at least one category name based on an occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set, wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set further comprises: assigning an intensity of confusion value to the one or more intersection sets based on a confusion matrix; identifying at least one intersection set from the one or more intersection sets having the intensity of confusion value higher than or equal to an intensity of confusion value of a largest intersection set; and replacing the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set by the at least one category name based on the occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set. 2. The method of claim 1 , wherein the at least one concept comprises one or more words. 3. The method of claim 1 , further comprising determining the largest intersection set from the one or more intersection sets, wherein the largest intersection set includes a highest number of common concepts. 4. The method of claim 3 , wherein the highest number of common concepts is replaced by the at least one category name based on an occurrence rate of each of the highest number of common concepts associated with the largest intersection set. 5. The method of claim 1 , wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set comprises: generating the confusion matrix of the pre-categorized one or more data records; identifying, from the confusion matrix, an intersection set associated with a maximum confusion value; and replacing the each of the at least one common concept associated with the intersection set with the maximum confusion value by the at least one category name based on the occurrence rate of the each of the at least one common concept associated with the intersection set with the maximum confusion value. 6. The method of claim 1 , wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set further comprises: removing, by the data categorizer, the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set from the one or more data records; and appending, by the data categorizer, the at least one category name to the one or more data records, for improved data classification, wherein the appending the category name further comprises: computing, by the data categorizer an occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set, wherein the occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set comprises the number of the plurality of category lists in which the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set occurs; and appending, by the data categorizer the category name based on the number of the plurality of category lists and occurrence rate in the one or more data records, wherein the category name comprises name of the predefined category in which the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set has the highest occurrence frequency, further wherein the occurrence frequency is the number of times, the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set occurs in each of the plurality of category lists. 7. A system for data cleansing, comprising: a hardware processor; and a memory storing instructions executable by the hardware processor for: receiving, by a data categorizer one or more data records pre- categorized into one or more categories; identifying, by the data categorizer, at least one concept associated with the one or more data records; grouping, by the data categorizer, the at least one concept associated with the one or more data records into a plurality of category lists based on the pre-categorized one or more categories associated with each of the one or more data records; determining, by the data categorizer, one or more intersection sets based on a comparison between each pair of the plurality of category lists, wherein each intersection set comprises a set of one or more common concepts associated with a pair of category lists; and replacing, by the data categorizer, each of at least one common concept of the set of one or more common concepts associated with the each intersection set by at least one category name based on an occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set, wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set further comprises: assigning an intensity of confusion value to the one or more intersection sets based on a confusion matrix; identifying at least one intersection set from the one or more intersection sets having the intensity of confusion value higher than or equal to an intensity of confusion value of a largest intersection set; and replacing the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set by the at least one category name based on the occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set. 8. The system of claim 7 , wherein the at least one concept comprises one or more words. 9. The system of claim 7 , further comprising determining the largest intersection set from the one or more intersection sets, wherein the largest intersection set includes a highest number of common concepts. 10. The system of claim 9 , wherein the highest number of common concepts is replaced by the at least one category name based on an occurrence rate of each of the highest number of common concepts associated with the largest intersection set. 11. The system of claim 7 , wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each
Clustering; Classification · CPC title
Semantic analysis · CPC title
Clustering or classification · CPC title
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Database tuning (G06F16/2282 takes precedence; database performance monitoring G06F11/3409) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.