Column lineage and metadata propagation

US11599539B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11599539-B2
Application numberUS-201916287631-A
CountryUS
Kind codeB2
Filing dateFeb 27, 2019
Priority dateDec 26, 2018
Publication dateMar 7, 2023
Grant dateMar 7, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A logical query plan to derive a target dataset from one or more source datasets is identified. The logical query plan defines source columns of the one or more source datasets and respective target columns of the target dataset. The logical query plan is parsed to derive relationships between the source columns of the one or more source datasets and the respective target columns of the target dataset. Target column metadata is generated for a target column of the target dataset. The target column metadata reflects a derived relationship between one or more source columns and the target column and existing source column metadata of each of the one or more source columns. The target column metadata is stored for the target column of the target dataset.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: identifying a logical query plan to derive a target dataset from one or more source datasets, wherein the logical query plan was generated from transformation code in a first programming language and comprises a hierarchical structure expressed as a tree of nodes of logical operators, wherein the logical query plan identifies a plurality of source columns of the one or more source datasets and respective target columns of the target dataset; parsing the logical query plan to derive relationships between the plurality of source columns of the one or more source datasets and the respective target columns of the target dataset; generating target column metadata for a target column of the target dataset, the target column metadata reflecting a derived relationship derived from the logical query plan and between one or more source columns and the target column and reflecting existing source column metadata of each of the one or more source columns; and storing the target column metadata for the target column of the target dataset. 2. The method of claim 1 , further comprising: responsive to determining that the logical query plan is not available for the transformation code in the first programming language, inferring the relationships between the plurality of source columns of the one or more source datasets and the respective target columns of the target dataset. 3. The method of claim 1 , wherein parsing the logical query plan to derive the relationships between the plurality of source columns of the one or more source datasets and the respective target columns of the target dataset, comprises: finding, in the logical query plan, one or more keywords associated with one or more first logical query plan portions that each identify a source dataset of the one or more source datasets; finding, in the logical query plan, one or more keywords associated with a second logical query plan portion that identifies the plurality of source columns of the one or more source datasets; finding, in the logical query plan, one or more keywords associated with a third logical query plan portion that identifies the respective target columns of the target dataset; and finding, for each of the respective target columns of the target dataset, one or more keywords associated with a fourth logical query plan portion describing a relationship between at least one of the one or more source columns of the one or more source datasets and the respective target column of the target dataset. 4. The method of claim 3 , wherein the relationship between the at least one source column of the one or more source datasets and the respective target column of the target dataset is at least one of a mapping between a name of the one or more source columns and a name of the respective target column, a database operation performed on the one or more source columns to derive the respective target column, or a function used to calculate values of the respective target column using values of the one or more source columns. 5. The method of claim 1 , wherein generating the target column metadata for the target column of the target dataset, further comprises: determining existing lineage metadata associated with each of the one or more source columns; and providing the existing lineage metadata associated with each of the one or more source columns for inclusion with the target column metadata for the target column of the target dataset. 6. The method of claim 1 , wherein generating the target column metadata for the target column of the target dataset further comprises: identifying user comments within the existing source column metadata of each of the one or more source columns; and providing the user comments for inclusion with the target column metadata for the target column of the target dataset. 7. The method of claim 1 , further comprising: determining whether the one or more source columns are associated with a column level access control policy; and responsive to determining the one or more source columns are associated with the column level access control policy, propagating the column level access control policy to the target column of the target dataset. 8. The method of claim 7 , wherein the one or more source columns comprise at least two source columns, the method further comprising: determining that the at least two source columns are associated with a plurality of column level access control policies; and selecting one of the plurality of column level access control policies to propagate to the target column of the target dataset. 9. The method of claim 2 , wherein inferring the relationships between the plurality of source columns of the one or more source datasets and the respective target columns of the target dataset, comprises: identifying, based on a list of datasets, one or more datasets that are source dataset candidates; finding, for each of the respective target columns, one or more source column candidates from the source dataset candidates; and inferring, for each of the respective target columns, a relationship between the one or more source column candidates and a respective target column of the respective target columns of the target dataset based on values in the one or more source column candidates and the target column of the target dataset. 10. The method of claim 9 , wherein finding, for each of the repsective target columns, the one or more source column candidates from the source dataset candidates comprises: comparing at least one of data types or column names of a plurality of columns of the source dataset candidates to data types or column names of the respective target columns of the target dataset. 11. The method of claim 1 , further comprising: providing a graphical user interface comprising a graph representing column lineage of the target column; and modifying the column lineage of the target column based on user input via the graphical user interface. 12. A system comprising: a memory; and a processing device, coupled to the memory, to: identify a logical query plan to derive a target dataset from one or more source datasets, wherein the logical query plan was generated from transformation code in a first programming language and comprises a hierarchical structure expressed as a tree of nodes of logical operators, wherein the logical query plan identifies a plurality of source columns of the one or more source datasets and respective target columns of the target dataset; parse the logical query plan to derive relationships between the plurality of source columns of the one or more source datasets and the respective target columns of the target dataset; generate target column metadata for a target column of the target dataset, the target column metadata reflecting a derived relationship derived from the logical query plan and between one or more source columns and the target column and reflecting existing source column metadata of each of the one or more source columns; and store the target column metadata for the target column of the target dataset. 13. The system of claim 12 , the processing device further to: responsive to determining that the logical query plan is not available for the transformation code in the first programming language, infer the relationships between the plurality of source columns of the one or more source datasets and the respective target columns of the target dataset. 14. The system of claim 12 , wherein to parse the logical query plan to derive the relationships between the plurality of source columns of the one or more source

Assignees

Inventors

Classifications

  • to a system of files or objects, e.g. local or distributed file system or database · CPC title

  • Plan optimisation · CPC title

  • Managing data history or versioning (querying versioned data G06F16/2474; querying temporal data G06F16/2477) · CPC title

  • Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title

  • G06F16/254Primary

    Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11599539B2 cover?
A logical query plan to derive a target dataset from one or more source datasets is identified. The logical query plan defines source columns of the one or more source datasets and respective target columns of the target dataset. The logical query plan is parsed to derive relationships between the source columns of the one or more source datasets and the respective target columns of the target …
Who is the assignee on this patent?
Palantir Technologies Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/24542. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 07 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).