Dynamically performing data processing in a data pipeline system

US11314698B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11314698-B2
Application numberUS-201816208435-A
CountryUS
Kind codeB2
Filing dateDec 3, 2018
Priority dateJul 6, 2017
Publication dateApr 26, 2022
Grant dateApr 26, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for automatically scheduling builds of derived datasets in a distributed database system that supports pipelined data transformations are described herein. In an embodiment, a data processing method comprises, in association with a distributed database system that implements one or more data transformation pipelines, each of the data transformation pipelines comprising at least a first dataset, a first transformation, a second derived dataset and dataset dependency and timing metadata, detecting an arrival of a new raw dataset or new derived dataset; in response to the detecting, obtaining from the dataset dependency and timing metadata a dataset subset comprising those datasets that depend on at least the new raw dataset or new derived dataset; for each member dataset in the dataset subset, determining if the member dataset has a dependency on any other dataset that is not yet arrived, and in response to determining that the member dataset does not have a dependency on any other dataset that is not yet arrived: initiating a build of a portion of the data transformation pipeline comprising the member dataset and all other datasets on which the member dataset is dependent, without waiting for arrival of other datasets.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: in association with a distributed data processing system that implements one or more data transformation pipelines, a data transformation pipeline of the data transformation pipelines comprising at least a first dataset, a first transformation, a second derived dataset, a third dataset, a fourth derived dataset, and dataset dependency and timing metadata; wherein the dataset dependency and timing metadata includes timestamp values for each of the first dataset, the second derived dataset, the third dataset, and the fourth derived dataset that indicate a last time at which those datasets were created or updated; determining that the first dataset on which the second derived dataset depends has not arrived by comparing a timestamp value that is stored with the first dataset to a corresponding timestamp value for the first dataset from the dataset dependency and timing metadata; determining that the third dataset on which the fourth derived dataset depends has arrived by comparing a timestamp that is stored with the third dataset to a corresponding timestamp value for the third dataset from the dataset dependency and timing metadata; in response thereto, initiating build operations for each dataset that depends only on the third dataset and any other datasets that have arrived but excluding building the second derived dataset; in response to determining that the third dataset has arrived, obtaining from the dataset dependency and timing metadata a dataset subset comprising at least the fourth derived dataset that depends on the third dataset; determining that the fourth derived dataset does not have a dependency on any other dataset that has not yet arrived and, in response, initiating a build of a portion of the data transformation pipeline comprising the fourth derived dataset and all other datasets on which the fourth derived dataset is dependent, without waiting for arrival of other datasets; wherein the method is performed using one or more processors. 2. The method of claim 1 , further comprising, in response to determining that the first dataset on which the second derived dataset depends has not yet arrived, recording that a partial dependency of the second derived dataset has been satisfied. 3. The method of claim 1 , the first dataset comprising any of a first raw dataset, or a first derived dataset that was derived via a second transformation. 4. The method of claim 1 , the first transformation comprising any of: creating the second derived dataset without a column that is in the first dataset; creating the second derived dataset with a column that is in the first dataset and using a different name of the column in the second derived dataset. 5. The method of claim 1 , further comprising: detecting that a cutoff time has occurred; in response to detecting that the cutoff time has occurred, transmitting a notification to a specified account or address; wherein the step of determining that the first dataset on which the second derived dataset depends has not arrived is performed in response to detecting that the cutoff time has occurred. 6. The method of claim 1 , further comprising: detecting that a cutoff time has occurred; in response to detecting that the cutoff time has occurred: determining that a particular dataset on which the second derived dataset depends has not arrived, and that the particular dataset is marked with a critical dataset flag value; in response thereto, transmitting a notification to a specified account or address. 7. The method of claim 1 , further comprising detecting an arrival of a new raw dataset or new derived dataset only for datasets that are identified in a list of raw datasets to track. 8. The method of claim 1 , further comprising detecting an arrival of a new raw dataset or new derived dataset only during an expected arrival period that is defined in stored configuration data. 9. The method of claim 1 , in which obtaining the dataset subset from the dataset dependency and timing metadata occurs just after the dataset dependency and timing metadata has been updated. 10. The method of claim 1 , wherein detecting an arrival of a new raw dataset or new derived dataset comprises determining that a timestamp of the new raw dataset or new derived dataset is not older, compared to a current time, than a specified recent time. 11. The method of claim 1 , wherein initiating a build comprises instantiating a build worker process and instructing the build worker process to build the portion of the data transformation pipeline comprising the second derived dataset and all other datasets on which the second derived dataset is dependent. 12. The method of claim 1 , the dataset dependency and timing metadata defining a non-directional dependency group of a plurality of datasets that are dependent upon one another, the method further comprising determining whether every dataset in the non-directional dependency group is updated, and initiating build operations for derived datasets depending upon the non-directional dependency group only when all datasets in the non-directional dependency group have received updates. 13. The method of claim 1 , the dataset dependency and timing metadata defining a directional dependency group of raw datasets all of which are dependent on a second group of datasets, the method further comprising determining that the directional dependency group of raw datasets is updated only after all datasets in the second group are updated, and initiating build operations for derived datasets depending upon the directional dependency group only when all datasets in the directional dependency group have received updates. 14. A computer system comprising: one or more processors; one or more computer-readable storage media coupled to the one or more processors and storing one or more sequences of instructions which, when executed using the one or more processors, cause the one or more processors to perform: in association with a distributed data processing system that implements one or more data transformation pipelines, a data transformation pipeline of the data transformation pipelines comprising at least a first dataset, a first transformation, a second derived dataset, a third dataset, a fourth derived dataset, and dataset dependency and timing metadata; wherein the dataset dependency and timing metadata includes timestamp values for each of the first dataset, the second derived dataset, the third dataset, and the fourth derived dataset that indicate a last time at which those datasets were created or updated; determining that the first dataset on which the second derived dataset depends has not arrived by comparing a timestamp value that is stored with the first dataset to a corresponding timestamp value for the first dataset from the dataset dependency and timing metadata; determining that the third dataset on which the fourth derived dataset depends has arrived by comparing a timestamp that is stored with the third dataset to a corresponding timestamp value for the third dataset from the dataset dependency and timing metadata; in response thereto, initiating build operations for each dataset that depends only on the third dataset and any other datasets that have arrived but excluding building the second derived dataset; in response to determining that the third dataset has arrived, obtaining from the dataset dependency and timing metadata a dataset subset comprising at least the fourth derived dataset that depends on the third dataset; determining that the fourth derived dataset does not have a depende

Assignees

Inventors

Classifications

  • G06F16/254Primary

    Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

  • Data stream processing; Continuous queries · CPC title

  • Ensuring data consistency and integrity · CPC title

  • Hypervisors; Virtual machine monitors · CPC title

  • G06F16/182Primary

    Distributed file systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11314698B2 cover?
Techniques for automatically scheduling builds of derived datasets in a distributed database system that supports pipelined data transformations are described herein. In an embodiment, a data processing method comprises, in association with a distributed database system that implements one or more data transformation pipelines, each of the data transformation pipelines comprising at least a fir…
Who is the assignee on this patent?
Palantir Technologies Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/254. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 26 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).