Interactive workflow generation for machine learning lifecycle management
US-11599813-B1 · Mar 7, 2023 · US
US12353375B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12353375-B2 |
| Application number | US-202318505942-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 9, 2023 |
| Priority date | Nov 9, 2023 |
| Publication date | Jul 8, 2025 |
| Grant date | Jul 8, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Selecting and ordering the execution of data quality rules includes generating a snapshot of a table-formatted dataset. The snapshot comprises a reduced number of rows of the dataset such that each column variation of the dataset is included in the snapshot. A predetermined collection of data quality (DQ) rules is executed on the snapshot. One or more performance statistics is determined for each of the DQ rules. The performance statistics indicate a likelihood that a DQ rule determines a data quality deficiency. Based on the performance statistics, a subset of the DQ rules is generated. Each DQ rule of the subset is selected based on the likelihood that the DQ rule selected detects a quality deficiency. An ordered subset of selected DQ rules is generated by ordering the application of each of the subset of DQ rules selected. The ordering specifies a sequence for executing each selected DQ rule.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: generating a snapshot of a table-formatted dataset, wherein the snapshot provides a sample comprising a reduced number of rows of the table-formatted dataset such that each column variation of the table-formatted dataset is included in the snapshot; executing a predetermined collection of data quality (DQ) rules on the snapshot; determining one or more performance statistics for each of the DQ rules, wherein the performance statistics indicate a likelihood that a DQ rule determines a data quality deficiency; generating, based on the performance statistics, a subset of the DQ rules, wherein each DQ rule of the subset is selected based on the likelihood that the DQ rule selected detects a quality deficiency; and generating an order of executing the subset of DQ rules selected, wherein the order generated specifies a sequence for applying each DQ rule of the subset to the table-formatted dataset. 2. The computer-implemented method of claim 1 , further comprising: applying each of ordered subset of selected DQ rules to the table-formatted dataset, wherein the applying is performed in accordance with the ordering; and flagging each row for which a data deficiency is detected by a DQ rule belonging to the ordered subset of DQ rules. 3. The computer-implemented method of claim 1 , further comprising: generating an updated snapshot in response to a newly received dataset; and generating, based on the updated snapshot, another ordered subset of selected DQ rules. 4. The computer-implemented method of claim 1 , further comprising: generating an explanation for selecting and ordering execution of each DQ rule belonging to the ordered subset of DQ rules. 5. The computer-implemented method of claim 1 , further comprising: performing data cleaning on each row of the table-formatted dataset having a quality deficiency detected by a DQ rule belonging to the ordered subset of DQ rules, wherein each row is marked for cleaning when a data deficiency is first detected by one of the DQ rules of the ordered subset of DQ rules and no other of the DQ rules executes on the row after the data deficiency is first detected. 6. The computer-implemented method of claim 1 , wherein the generating a snapshot includes constructing a bipartite graph comprising two sets of vertices, one set of vertices corresponding to rows of the table-formatted dataset, and one set of vertices corresponding to cell values for each column of the table-formatted dataset. 7. The computer-implemented method of claim 1 , wherein the ordering is based on minimizing processing time for detecting one or more quality deficiencies of one or more rows of the table-formatted dataset. 8. A system, comprising: one or more processors configured to initiate operations including: generating a snapshot of a table-formatted dataset, wherein the snapshot provides a sample comprising a reduced number of rows of the table-formatted dataset such that each column variation of the table-formatted dataset is included in the snapshot; executing a predetermined collection of data quality (DQ) rules on the snapshot; determining one or more performance statistics for each of the DQ rules, wherein the performance statistics indicate a likelihood that a DQ rule determines a data quality deficiency; generating, based on the performance statistics, a subset of the DQ rules, wherein each DQ rule of the subset is selected based on the likelihood that the DQ rule selected detects a quality deficiency; and generating an order of executing the subset of DQ rules selected, wherein the order generated specifies a sequence for applying each DQ rule of the subset to the table-formatted dataset. 9. The system of claim 8 , wherein the one or more processors are configured to initiate operations further including: applying each of ordered subset of selected DQ rules to the table-formatted dataset, wherein the applying is performed in accordance with the ordering; and flagging each row for which a data deficiency is detected by a DQ rule belonging to the ordered subset of DQ rules. 10. The system of claim 8 , wherein the one or more processors are configured to initiate operations further including: generating an updated snapshot in response to a newly received dataset; and generating, based on the updated snapshot, another ordered subset of selected DQ rules. 11. The system of claim 8 , wherein the one or more processors are configured to initiate operations further including: generating an explanation for selecting and ordering execution of each DQ rule belonging to the ordered subset of DQ rules. 12. The system of claim 8 , wherein the one or more processors are configured to initiate operations further including: performing data cleaning on each row of the table-formatted dataset having a quality deficiency detected by a DQ rule belonging to the ordered subset of DQ rules, wherein a row is marked for cleaning when a data deficiency is first detected by one of the DQ rules belong to the ordered subset of DQ rules and no other of the DQ rules executes on the row after the data deficiency is first detected. 13. The system of claim 8 , wherein the generating a snapshot includes constructing a bipartite graph comprising two sets of vertices, one set of vertices corresponding to rows of the table-formatted dataset, and one set of vertices corresponding to cell values for each column of the table-formatted dataset. 14. A computer program product, the computer program product comprising: one or more computer-readable storage media and program instructions collectively stored on the one or more computer-readable storage media, the program instructions executable by a processor to cause the processor to initiate operations including: generating a snapshot of a table-formatted dataset, wherein the snapshot provides a sample comprising a reduced number of rows of the table-formatted dataset such that each column variation of the table-formatted dataset is included in the snapshot; executing a predetermined collection of data quality (DQ) rules on the snapshot; determining one or more performance statistics for each of the DQ rules, wherein the performance statistics indicate a likelihood that a DQ rule determines a data quality deficiency; generating, based on the performance statistics, a subset of the DQ rules, wherein each DQ rule of the subset is selected based on the likelihood that the DQ rule selected detects a quality deficiency; and generating an order of executing the subset of DQ rules selected, wherein the order generated specifies a sequence for applying each DQ rule of the subset to the table-formatted dataset. 15. The computer program product of claim 14 , wherein the program instructions are executable by the processor to cause the processor to initiate operations further including: applying each of ordered subset of selected DQ rules to the table-formatted dataset, wherein the applying is performed in accordance with the ordering; and flagging each row for which a data deficiency is detected by a DQ rule belonging to the ordered subset of DQ rules. 16. The computer program product of claim 14 , wherein the program instructions are executable by the processor to cause the processor to initiate operations further including: generating an updated snapshot in response to a newly received dataset; and generating, based on the updated snapshot, another ordered subset of selected DQ rules. 17. The computer program product of claim 14 , wherein the program
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.