Automated etl workflow generation
US-2022043826-A1 · Feb 10, 2022 · US
US2022374442A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022374442-A1 |
| Application number | US-202117303167-A |
| Country | US |
| Kind code | A1 |
| Filing date | May 21, 2021 |
| Priority date | May 21, 2021 |
| Publication date | Nov 24, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In some implementations, a monitoring device may receive configuration information associated with an extract, transform, load (ETL) pipeline that includes one or more data sources and one or more data sinks. The monitoring device may generate, based on the configuration information, lineage data related to a data flow from the one or more data sources to the one or more data sinks in the ETL pipeline. The monitoring device may generate one or more predicted quality metrics associated with the ETL pipeline using a machine learning model. The monitoring device may generate a visualization in which multiple nodes are arranged to indicate the data flow from the one or more data sources to the one or more data sinks and further in which the one or more predicted quality metrics are encoded within the visualization.
Opening claim text (preview).
What is claimed is: 1 . A system for monitoring an extract, transform, load (ETL) pipeline, comprising: one or more memories; and one or more processors, coupled to the one or more memories, configured to: receive configuration information associated with the ETL pipeline that includes one or more data sources and one or more data sinks, wherein the configuration information indicates data records to be extracted from the one or more data sources, transformed from a source format to a target format, and loaded into the one or more data sinks; generate, based on the configuration information, lineage data related to a data flow from the one or more data sources to the one or more data sinks in the ETL pipeline; generate one or more predicted quality metrics associated with the ETL pipeline using a machine learning model, wherein the machine learning model is trained using historical execution data associated with one or more ETL jobs; and generate a visualization in which multiple nodes are arranged to indicate the data flow from the one or more data sources to the one or more data sinks and further in which the one or more predicted quality metrics are encoded within the visualization. 2 . The system of claim 1 , wherein the multiple nodes represent one or more source tables storing the data records to be extracted, transformed, and loaded, one or more target tables into which the data records are to be loaded, and one or more intermediate source tables in the data flow from the one or more source tables to the one or more target tables. 3 . The system of claim 2 , wherein the lineage data includes dependencies among the one or more source tables, the one or more intermediate source tables, and the one or more target tables. 4 . The system of claim 2 , wherein the multiple nodes are arranged across multiple columns and the visualization includes user interface elements linking the multiple nodes to indicate the data flow from the one or more source tables to the one or more target tables. 5 . The system of claim 4 , wherein the multiple nodes and the user interface elements linking the multiple nodes are each depicted in the visualization using a color in a color palette. 6 . The system of claim 2 , wherein the one or more predicted quality metrics relate to one or more of a timeliness, a service level agreement, or an accuracy associated with an ETL task configured to process the data records in the one or more source tables, the one or more intermediate source tables, or the one or more target tables. 7 . The system of claim 1 , wherein the one or more processors are further configured to: detect, using the machine learning model, a failure or an anomaly in the data flow from the one or more data sources to the one or more data sinks in the ETL pipeline; and cause one or more of the multiple nodes in the visualization to be depicted using one or more colors to indicate a portion of the data flow affected by the failure or the anomaly. 8 . The system of claim 7 , wherein the one or more processors are further configured to: terminate an ETL task associated with the ETL pipeline based on the failure or the anomaly in the data flow. 9 . The system of claim 7 , wherein the one or more processors are further configured to: send a message to one or more users based on the failure or the anomaly in the data flow, wherein the message includes information related to the failure or the anomaly in the data flow and information related to one or more suggested actions to remediate the failure or the anomaly in the data flow. 10 . The system of claim 1 , wherein the one or more predicted quality metrics are encoded within the visualization such that information related to the one or more predicted quality metrics are depicted in the visualization based on interaction with one or more user interface elements. 11 . A method for visualizing information related to an extract, transform, load (ETL) pipeline, comprising: receiving, by an ETL monitoring device, configuration information associated with the ETL pipeline that includes one or more data sources and one or more data sinks, wherein the configuration information indicates data records to be extracted from the one or more data sources, transformed from a source format to a target format, and loaded into the one or more data sinks; generating, by the ETL monitoring device, based on the configuration information, lineage data related to a data flow from the one or more data sources to the one or more data sinks in the ETL pipeline; and generating, by the ETL monitoring device, based on the lineage data, a visualization including multiple nodes that are linked by user interface elements to indicate the data flow from the one or more data sources to the one or more data sinks. 12 . The method of claim 11 , wherein the multiple nodes represent one or more source tables storing the data records to be extracted, transformed, and loaded, one or more target tables into which the data records are to be loaded, and one or more intermediate source tables in the data flow from the one or more source tables to the one or more target tables. 13 . The method of claim 11 , further comprising: generating one or more predicted quality metrics associated with the ETL pipeline using a machine learning model that is trained using historical execution data associated with one or more ETL jobs; and configuring the visualization to indicate the one or more predicted quality metrics by depicting one or more of the multiple nodes or one or more of the user interface elements linking the multiple nodes using a color in a color palette. 14 . The method of claim 11 , further comprising: generating one or more predicted quality metrics associated with the ETL pipeline using a machine learning model that is trained using historical execution data associated with one or more ETL jobs; and configuring the visualization to depict information related to the one or more predicted quality metrics based on interaction with one or more of the multiple nodes or the user interface elements linking the multiple nodes. 15 . The method of claim 11 , further comprising: detecting a failure or an anomaly in the data flow from the one or more data sources to the one or more data sinks in the ETL pipeline; and performing one or more actions based on the failure or the anomaly in the data flow, wherein performing the one or more actions includes one or more of: causing one or more of the multiple nodes or the user interface elements linking the multiple nodes to be depicted in the visualization using one or more colors to indicate a portion of the data flow affected by the failure or the anomaly, terminating an ETL task associated with the ETL pipeline, or generating a message that includes information related to the failure or the anomaly in the data flow and information related to one or more suggested actions to remediate the failure or the anomaly in the data flow. 16 . A non-transitory computer-readable medium storing a set of instructions, the set of instructions comprising: one or more instructions that, when executed by one or more processors of an extract, transform, load (ETL) monitoring device, cause the ETL monitoring device to: generate lineage data related to a data flow from one or more data sources to one or more data sinks in an ETL pipeline; detect a failure or an anomaly in the ETL pipeline using a machine learning model that is trained using historical execution data associated with one or more ETL jobs; and gener
Visual data mining; Browsing structured data · CPC title
Tablespace storage structures; Management thereof · CPC title
Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title
Machine learning · CPC title
Inference or reasoning models · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.