Data quality assessment

US9558230B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9558230-B2
Application numberUS-201313764880-A
CountryUS
Kind codeB2
Filing dateFeb 12, 2013
Priority dateFeb 12, 2013
Publication dateJan 31, 2017
Grant dateJan 31, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

According to one embodiment of the present invention, a system assesses the quality of column data. The system assigns a pre-defined domain to one or more columns of the data based on a validity condition for the domain, applies the validity condition for the domain assigned to a column to data values in the column to compute a data quality metric for the column, and computes and displays a metric for a group of columns based on the computed data quality metric of at least one column in the group. Embodiments of the present invention further include a method and computer program product for assessing the quality of column data in substantially the same manners described above.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for assessing the quality of data comprising: at least one processor and a metadata repository, wherein the at least one processor is configured to: apply a set of validity conditions for each of a plurality of pre-defined domains in the metadata repository to each of a plurality of columns of the data, wherein at least one pre-defined domain is associated with a plurality of rules specifying the set of validity conditions for data of that at least one pre-defined domain; assign pre-defined domains selected from among the plurality of pre-defined domains to corresponding columns of the data based on satisfaction of the set of validity conditions for the pre-defined domains, wherein each of at least two of the columns of the data remains unassigned to a corresponding domain; generate a group of two or more of the unassigned columns based on characteristics of the unassigned columns; create a new domain for the group of unassigned columns with a corresponding set of validity conditions and assign the columns of the group to the new domain; apply the set of validity conditions for the domain assigned to a column to data values in the column to compute a data quality metric for the column; and display a metric for one or more sets of columns that is computed based on the computed data quality metric of at least one column in each set of columns. 2. The system of claim 1 , wherein the set of validity conditions of each domain assigned to a column of the data includes at least one of a list of valid values, a list of invalid values, a regular expression, a list of valid formats, a list of invalid formats, and a list of data rules that operate on values within a single column. 3. The system of claim 1 , wherein the at least one processor is further configured to: compute a signature for each of a plurality of columns; assign two or more of the unassigned columns with similar computed signatures to the new domain, wherein the set of validity conditions for the new domain is based on characteristics of the signatures of the columns assigned to the new domain. 4. The system of claim 3 , wherein the at least one processor is further configured to: name the new domain; and generate the set of validity conditions for the new domain by modifying validity criteria based on the characteristics of the signatures and defining a validity condition for the new domain. 5. The system of claim 1 , wherein the at least one processor is further configured to: display values of a data column with indicia of which values comply with an original validity condition; update the display in response to a modification to the original validity condition; modify a validity condition to include a value in a list of invalid values in response to receiving an indication that the value is invalid; and assign the modified validity condition to a domain associated with the column. 6. The system of claim 1 , wherein the one or more sets of columns comprise tables and databases; the metric for one or more sets of columns comprises: a measure of the number of values of a column not matching the set of validity conditions for the domain assigned to the column, a table metric, where the table metric is a measure of the number of rows of a table having at least one value in a column that does not match the set of validity conditions for the domain assigned to the column, a sum or average of table metrics for a group of tables, and a measure of the number of columns of a set of columns for which a domain having a set of validity conditions is defined; and displaying a metric comprises displaying a visual cue indicating that the metric satisfies a pre-determined threshold. 7. A computer program product for assessing data quality comprising: a computer readable storage device having computer readable program code embodied therewith for execution on a first processing system, the computer readable program code comprising computer readable program code configured to: apply a set of validity conditions for each of a plurality of pre-defined domains to each of a plurality of columns of the data, wherein at least one pre-defined domain is associated with a plurality of rules specifying the set of validity conditions for data of that at least one pre-defined domain; assign pre-defined domains selected from among the plurality of pre-defined domains to corresponding columns of the data based on satisfaction of the set of validity conditions for the pre-defined domains, wherein each of at least two of the columns of the data remains unassigned to a corresponding domain; generate a group of two or more of the unassigned columns based on characteristics of the unassigned columns; create a new domain for the group of unassigned columns with a corresponding set of validity conditions and assign the columns of the group to the new domain; apply the set of validity conditions for the domain assigned to a column to data values in the column to compute a data quality metric for the column; and display a metric for one or more sets of columns that is computed based on the computed data quality metric of at least one column in each set of columns. 8. The computer program product of claim 7 , wherein the set of validity conditions of each domain assigned to a column of the data includes at least one of a list of valid values, a list of invalid values, a regular expression, a list of valid formats, a list of invalid formats, and a list of data rules that operate on values within a single column. 9. The computer program product of claim 7 , wherein the computer readable program code is further configured to: compute a signature for each of a plurality of columns; assign two or more of the unassigned columns with similar computed signatures to the new domain, wherein the set of validity conditions for the new domain is based on characteristics of the signatures of the columns assigned to the new domain. 10. The computer program product of claim 9 , wherein the computer readable program code is further configured to: name the new domain; and generate the set of validity conditions for the new domain by modifying validity criteria based on the characteristics of the signatures and defining a validity condition for the new domain. 11. The computer program product of claim 7 , wherein the computer readable program code is further configured to: display values of a data column with indicia of which values comply with an original validity condition; update the display in response to a modification to the original validity condition; modify a validity condition to include a value in a list of invalid values in response to receiving an indication that the value is invalid; and assign the modified validity condition to a domain associated with the column. 12. The computer program product of claim 7 , wherein the one or more sets of columns comprise tables and databases; the metric for one or more sets of columns comprises: a measure of the number of values of a column not matching the set of validity conditions for the domain assigned to the column, a table metric, where the table metric is a measure of the number of rows of a table having at least one value in a column that does not match the set of validity conditions for the domain assigned to the column, a sum or average of table metrics for a group of tables, and a measure of the number of columns of a set of columns for which a domain having a validity condition is defined; and displaying a metric comprises displaying a visual cue indicating that the metric satisfies a pre-determined threshold.

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • Physics · mapped topic

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Ensuring data consistency and integrity · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9558230B2 cover?
According to one embodiment of the present invention, a system assesses the quality of column data. The system assigns a pre-defined domain to one or more columns of the data based on a validity condition for the domain, applies the validity condition for the domain assigned to a column to data values in the column to compute a data quality metric for the column, and computes and displays a met…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F17/30371. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 31 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).