Automated selection and ordering of data quality rules during data ingestion

US12353375B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12353375-B2
Application numberUS-202318505942-A
CountryUS
Kind codeB2
Filing dateNov 9, 2023
Priority dateNov 9, 2023
Publication dateJul 8, 2025
Grant dateJul 8, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Selecting and ordering the execution of data quality rules includes generating a snapshot of a table-formatted dataset. The snapshot comprises a reduced number of rows of the dataset such that each column variation of the dataset is included in the snapshot. A predetermined collection of data quality (DQ) rules is executed on the snapshot. One or more performance statistics is determined for each of the DQ rules. The performance statistics indicate a likelihood that a DQ rule determines a data quality deficiency. Based on the performance statistics, a subset of the DQ rules is generated. Each DQ rule of the subset is selected based on the likelihood that the DQ rule selected detects a quality deficiency. An ordered subset of selected DQ rules is generated by ordering the application of each of the subset of DQ rules selected. The ordering specifies a sequence for executing each selected DQ rule.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: generating a snapshot of a table-formatted dataset, wherein the snapshot provides a sample comprising a reduced number of rows of the table-formatted dataset such that each column variation of the table-formatted dataset is included in the snapshot; executing a predetermined collection of data quality (DQ) rules on the snapshot; determining one or more performance statistics for each of the DQ rules, wherein the performance statistics indicate a likelihood that a DQ rule determines a data quality deficiency; generating, based on the performance statistics, a subset of the DQ rules, wherein each DQ rule of the subset is selected based on the likelihood that the DQ rule selected detects a quality deficiency; and generating an order of executing the subset of DQ rules selected, wherein the order generated specifies a sequence for applying each DQ rule of the subset to the table-formatted dataset. 2. The computer-implemented method of claim 1 , further comprising: applying each of ordered subset of selected DQ rules to the table-formatted dataset, wherein the applying is performed in accordance with the ordering; and flagging each row for which a data deficiency is detected by a DQ rule belonging to the ordered subset of DQ rules. 3. The computer-implemented method of claim 1 , further comprising: generating an updated snapshot in response to a newly received dataset; and generating, based on the updated snapshot, another ordered subset of selected DQ rules. 4. The computer-implemented method of claim 1 , further comprising: generating an explanation for selecting and ordering execution of each DQ rule belonging to the ordered subset of DQ rules. 5. The computer-implemented method of claim 1 , further comprising: performing data cleaning on each row of the table-formatted dataset having a quality deficiency detected by a DQ rule belonging to the ordered subset of DQ rules, wherein each row is marked for cleaning when a data deficiency is first detected by one of the DQ rules of the ordered subset of DQ rules and no other of the DQ rules executes on the row after the data deficiency is first detected. 6. The computer-implemented method of claim 1 , wherein the generating a snapshot includes constructing a bipartite graph comprising two sets of vertices, one set of vertices corresponding to rows of the table-formatted dataset, and one set of vertices corresponding to cell values for each column of the table-formatted dataset. 7. The computer-implemented method of claim 1 , wherein the ordering is based on minimizing processing time for detecting one or more quality deficiencies of one or more rows of the table-formatted dataset. 8. A system, comprising: one or more processors configured to initiate operations including: generating a snapshot of a table-formatted dataset, wherein the snapshot provides a sample comprising a reduced number of rows of the table-formatted dataset such that each column variation of the table-formatted dataset is included in the snapshot; executing a predetermined collection of data quality (DQ) rules on the snapshot; determining one or more performance statistics for each of the DQ rules, wherein the performance statistics indicate a likelihood that a DQ rule determines a data quality deficiency; generating, based on the performance statistics, a subset of the DQ rules, wherein each DQ rule of the subset is selected based on the likelihood that the DQ rule selected detects a quality deficiency; and generating an order of executing the subset of DQ rules selected, wherein the order generated specifies a sequence for applying each DQ rule of the subset to the table-formatted dataset. 9. The system of claim 8 , wherein the one or more processors are configured to initiate operations further including: applying each of ordered subset of selected DQ rules to the table-formatted dataset, wherein the applying is performed in accordance with the ordering; and flagging each row for which a data deficiency is detected by a DQ rule belonging to the ordered subset of DQ rules. 10. The system of claim 8 , wherein the one or more processors are configured to initiate operations further including: generating an updated snapshot in response to a newly received dataset; and generating, based on the updated snapshot, another ordered subset of selected DQ rules. 11. The system of claim 8 , wherein the one or more processors are configured to initiate operations further including: generating an explanation for selecting and ordering execution of each DQ rule belonging to the ordered subset of DQ rules. 12. The system of claim 8 , wherein the one or more processors are configured to initiate operations further including: performing data cleaning on each row of the table-formatted dataset having a quality deficiency detected by a DQ rule belonging to the ordered subset of DQ rules, wherein a row is marked for cleaning when a data deficiency is first detected by one of the DQ rules belong to the ordered subset of DQ rules and no other of the DQ rules executes on the row after the data deficiency is first detected. 13. The system of claim 8 , wherein the generating a snapshot includes constructing a bipartite graph comprising two sets of vertices, one set of vertices corresponding to rows of the table-formatted dataset, and one set of vertices corresponding to cell values for each column of the table-formatted dataset. 14. A computer program product, the computer program product comprising: one or more computer-readable storage media and program instructions collectively stored on the one or more computer-readable storage media, the program instructions executable by a processor to cause the processor to initiate operations including: generating a snapshot of a table-formatted dataset, wherein the snapshot provides a sample comprising a reduced number of rows of the table-formatted dataset such that each column variation of the table-formatted dataset is included in the snapshot; executing a predetermined collection of data quality (DQ) rules on the snapshot; determining one or more performance statistics for each of the DQ rules, wherein the performance statistics indicate a likelihood that a DQ rule determines a data quality deficiency; generating, based on the performance statistics, a subset of the DQ rules, wherein each DQ rule of the subset is selected based on the likelihood that the DQ rule selected detects a quality deficiency; and generating an order of executing the subset of DQ rules selected, wherein the order generated specifies a sequence for applying each DQ rule of the subset to the table-formatted dataset. 15. The computer program product of claim 14 , wherein the program instructions are executable by the processor to cause the processor to initiate operations further including: applying each of ordered subset of selected DQ rules to the table-formatted dataset, wherein the applying is performed in accordance with the ordering; and flagging each row for which a data deficiency is detected by a DQ rule belonging to the ordered subset of DQ rules. 16. The computer program product of claim 14 , wherein the program instructions are executable by the processor to cause the processor to initiate operations further including: generating an updated snapshot in response to a newly received dataset; and generating, based on the updated snapshot, another ordered subset of selected DQ rules. 17. The computer program product of claim 14 , wherein the program

Assignees

Inventors

Classifications

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12353375B2 cover?
Selecting and ordering the execution of data quality rules includes generating a snapshot of a table-formatted dataset. The snapshot comprises a reduced number of rows of the dataset such that each column variation of the dataset is included in the snapshot. A predetermined collection of data quality (DQ) rules is executed on the snapshot. One or more performance statistics is determined for ea…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 08 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 10 related publications on this page (citations in our corpus or others sharing the same primary CPC).