System and method for data cleansing

US10318554B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10318554-B2
Application numberUS-201615246256-A
CountryUS
Kind codeB2
Filing dateAug 24, 2016
Priority dateJun 20, 2016
Publication dateJun 11, 2019
Grant dateJun 11, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

System and method for data cleansing are disclosed. The method comprises receiving one or more data records pre-categorized into one or more categories. Identifying at least one concept associated with one or more data records, and grouping, the at least one concept associated with the one or more data records into a plurality of category lists based on the predefined category associated with each of the one or more data records. Determining, one or more intersection sets based on a comparison between each pair of the plurality of category lists, wherein each intersection set comprises a set of one or more common concepts associated with a pair of category lists. The method comprises replacing each of at least one common concept of the set of one or more common concepts associated with each intersection set by a category name based on an occurrence rate of the common concepts.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of data cleansing, the method comprising: receiving, by a data categorizer, one or more data records pre-categorized into one or more categories; identifying, by the data categorizer, at least one concept associated with the one or more data records; grouping, by the data categorizer, the at least one concept associated with the one or more data records into a plurality of category lists based on the pre-categorized one or more categories associated with each of the one or more data records; determining, by the data categorizer, one or more intersection sets based on a comparison between each pair of the plurality of category lists, wherein each intersection set comprises a set of one or more common concepts associated with a pair of category lists; and replacing, by the data categorizer, each of at least one common concept of the set of one or more common concepts associated with the each intersection set by at least one category name based on an occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set, wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set further comprises: assigning an intensity of confusion value to the one or more intersection sets based on a confusion matrix; identifying at least one intersection set from the one or more intersection sets having the intensity of confusion value higher than or equal to an intensity of confusion value of a largest intersection set; and replacing the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set by the at least one category name based on the occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set. 2. The method of claim 1 , wherein the at least one concept comprises one or more words. 3. The method of claim 1 , further comprising determining the largest intersection set from the one or more intersection sets, wherein the largest intersection set includes a highest number of common concepts. 4. The method of claim 3 , wherein the highest number of common concepts is replaced by the at least one category name based on an occurrence rate of each of the highest number of common concepts associated with the largest intersection set. 5. The method of claim 1 , wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set comprises: generating the confusion matrix of the pre-categorized one or more data records; identifying, from the confusion matrix, an intersection set associated with a maximum confusion value; and replacing the each of the at least one common concept associated with the intersection set with the maximum confusion value by the at least one category name based on the occurrence rate of the each of the at least one common concept associated with the intersection set with the maximum confusion value. 6. The method of claim 1 , wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set further comprises: removing, by the data categorizer, the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set from the one or more data records; and appending, by the data categorizer, the at least one category name to the one or more data records, for improved data classification, wherein the appending the category name further comprises: computing, by the data categorizer an occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set, wherein the occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set comprises the number of the plurality of category lists in which the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set occurs; and appending, by the data categorizer the category name based on the number of the plurality of category lists and occurrence rate in the one or more data records, wherein the category name comprises name of the predefined category in which the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set has the highest occurrence frequency, further wherein the occurrence frequency is the number of times, the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set occurs in each of the plurality of category lists. 7. A system for data cleansing, comprising: a hardware processor; and a memory storing instructions executable by the hardware processor for: receiving, by a data categorizer one or more data records pre- categorized into one or more categories; identifying, by the data categorizer, at least one concept associated with the one or more data records; grouping, by the data categorizer, the at least one concept associated with the one or more data records into a plurality of category lists based on the pre-categorized one or more categories associated with each of the one or more data records; determining, by the data categorizer, one or more intersection sets based on a comparison between each pair of the plurality of category lists, wherein each intersection set comprises a set of one or more common concepts associated with a pair of category lists; and replacing, by the data categorizer, each of at least one common concept of the set of one or more common concepts associated with the each intersection set by at least one category name based on an occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set, wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set further comprises: assigning an intensity of confusion value to the one or more intersection sets based on a confusion matrix; identifying at least one intersection set from the one or more intersection sets having the intensity of confusion value higher than or equal to an intensity of confusion value of a largest intersection set; and replacing the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set by the at least one category name based on the occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set. 8. The system of claim 7 , wherein the at least one concept comprises one or more words. 9. The system of claim 7 , further comprising determining the largest intersection set from the one or more intersection sets, wherein the largest intersection set includes a highest number of common concepts. 10. The system of claim 9 , wherein the highest number of common concepts is replaced by the at least one category name based on an occurrence rate of each of the highest number of common concepts associated with the largest intersection set. 11. The system of claim 7 , wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each

Assignees

Inventors

Classifications

  • Clustering; Classification · CPC title

  • Semantic analysis · CPC title

  • G06F16/285Primary

    Clustering or classification · CPC title

  • Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Database tuning (G06F16/2282 takes precedence; database performance monitoring G06F11/3409) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10318554B2 cover?
System and method for data cleansing are disclosed. The method comprises receiving one or more data records pre-categorized into one or more categories. Identifying at least one concept associated with one or more data records, and grouping, the at least one concept associated with the one or more data records into a plurality of category lists based on the predefined category associated with e…
Who is the assignee on this patent?
Wipro Ltd
What technology area does this patent fall under?
Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 11 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).