What technology area does this patent fall under?

Primary CPC classification G06F16/285. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 11 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and method for data cleansing

US10318554B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10318554-B2
Application number	US-201615246256-A
Country	US
Kind code	B2
Filing date	Aug 24, 2016
Priority date	Jun 20, 2016
Publication date	Jun 11, 2019
Grant date	Jun 11, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

System and method for data cleansing are disclosed. The method comprises receiving one or more data records pre-categorized into one or more categories. Identifying at least one concept associated with one or more data records, and grouping, the at least one concept associated with the one or more data records into a plurality of category lists based on the predefined category associated with each of the one or more data records. Determining, one or more intersection sets based on a comparison between each pair of the plurality of category lists, wherein each intersection set comprises a set of one or more common concepts associated with a pair of category lists. The method comprises replacing each of at least one common concept of the set of one or more common concepts associated with each intersection set by a category name based on an occurrence rate of the common concepts.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of data cleansing, the method comprising: receiving, by a data categorizer, one or more data records pre-categorized into one or more categories; identifying, by the data categorizer, at least one concept associated with the one or more data records; grouping, by the data categorizer, the at least one concept associated with the one or more data records into a plurality of category lists based on the pre-categorized one or more categories associated with each of the one or more data records; determining, by the data categorizer, one or more intersection sets based on a comparison between each pair of the plurality of category lists, wherein each intersection set comprises a set of one or more common concepts associated with a pair of category lists; and replacing, by the data categorizer, each of at least one common concept of the set of one or more common concepts associated with the each intersection set by at least one category name based on an occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set, wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set further comprises: assigning an intensity of confusion value to the one or more intersection sets based on a confusion matrix; identifying at least one intersection set from the one or more intersection sets having the intensity of confusion value higher than or equal to an intensity of confusion value of a largest intersection set; and replacing the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set by the at least one category name based on the occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set. 2. The method of claim 1 , wherein the at least one concept comprises one or more words. 3. The method of claim 1 , further comprising determining the largest intersection set from the one or more intersection sets, wherein the largest intersection set includes a highest number of common concepts. 4. The method of claim 3 , wherein the highest number of common concepts is replaced by the at least one category name based on an occurrence rate of each of the highest number of common concepts associated with the largest intersection set. 5. The method of claim 1 , wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set comprises: generating the confusion matrix of the pre-categorized one or more data records; identifying, from the confusion matrix, an intersection set associated with a maximum confusion value; and replacing the each of the at least one common concept associated with the intersection set with the maximum confusion value by the at least one category name based on the occurrence rate of the each of the at least one common concept associated with the intersection set with the maximum confusion value. 6. The method of claim 1 , wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set further comprises: removing, by the data categorizer, the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set from the one or more data records; and appending, by the data categorizer, the at least one category name to the one or more data records, for improved data classification, wherein the appending the category name further comprises: computing, by the data categorizer an occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set, wherein the occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set comprises the number of the plurality of category lists in which the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set occurs; and appending, by the data categorizer the category name based on the number of the plurality of category lists and occurrence rate in the one or more data records, wherein the category name comprises name of the predefined category in which the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set has the highest occurrence frequency, further wherein the occurrence frequency is the number of times, the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set occurs in each of the plurality of category lists. 7. A system for data cleansing, comprising: a hardware processor; and a memory storing instructions executable by the hardware processor for: receiving, by a data categorizer one or more data records pre- categorized into one or more categories; identifying, by the data categorizer, at least one concept associated with the one or more data records; grouping, by the data categorizer, the at least one concept associated with the one or more data records into a plurality of category lists based on the pre-categorized one or more categories associated with each of the one or more data records; determining, by the data categorizer, one or more intersection sets based on a comparison between each pair of the plurality of category lists, wherein each intersection set comprises a set of one or more common concepts associated with a pair of category lists; and replacing, by the data categorizer, each of at least one common concept of the set of one or more common concepts associated with the each intersection set by at least one category name based on an occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set, wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each intersection set further comprises: assigning an intensity of confusion value to the one or more intersection sets based on a confusion matrix; identifying at least one intersection set from the one or more intersection sets having the intensity of confusion value higher than or equal to an intensity of confusion value of a largest intersection set; and replacing the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set by the at least one category name based on the occurrence rate of the each of the at least one common concept of the set of one or more common concepts associated with the at least one intersection set. 8. The system of claim 7 , wherein the at least one concept comprises one or more words. 9. The system of claim 7 , further comprising determining the largest intersection set from the one or more intersection sets, wherein the largest intersection set includes a highest number of common concepts. 10. The system of claim 9 , wherein the highest number of common concepts is replaced by the at least one category name based on an occurrence rate of each of the highest number of common concepts associated with the largest intersection set. 11. The system of claim 7 , wherein replacing the each of the at least one common concept of the set of one or more common concepts associated with the each

Assignees

Wipro Ltd

Inventors

Classifications

G06F16/35
Clustering; Classification · CPC title
G06F40/30
Semantic analysis · CPC title
G06F16/285Primary
Clustering or classification · CPC title
G06F16/215
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
G06F16/217
Database tuning (G06F16/2282 takes precedence; database performance monitoring G06F11/3409) · CPC title

Patent family

Related publications grouped by family.

View patent family 57133001

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10318554B2 cover?: System and method for data cleansing are disclosed. The method comprises receiving one or more data records pre-categorized into one or more categories. Identifying at least one concept associated with one or more data records, and grouping, the at least one concept associated with the one or more data records into a plurality of category lists based on the predefined category associated with e…
Who is the assignee on this patent?: Wipro Ltd
What technology area does this patent fall under?: Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 11 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Document segmentation, interpretation, and re-organization

System and method of data cleansing for improved data classification

Automatic taxonomy alignment

Discriminative language model training using a confusion matrix

Method for classifying unknown electronic documents based upon at least one classificaton

Frequently asked questions