Data revision control in large-scale data analytic systems

US10007674B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10007674-B2
Application numberUS-201615262207-A
CountryUS
Kind codeB2
Filing dateSep 12, 2016
Priority dateJun 13, 2016
Publication dateJun 26, 2018
Grant dateJun 26, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented system and method for data revision control in a large-scale data analytic systems. In one embodiment, for example, a computer-implemented method comprises the operations of storing a first version of a dataset that is derived by executing a first version of driver program associated with the dataset; and storing a first build catalog entry comprising an identifier of the first version of the dataset and comprising an identifier of the first version of the driver program.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for data revision control in a large-scale data analytic system: at one or more machines comprising one or more processors and memory storing one or more programs executed by the one or more processors to perform the method, performing operations comprising: storing a first version of a first dataset that is derived from a first version of a second dataset based on a first execution of a first version of a driver program; storing a first build catalog entry comprising an identifier of the first version of the first dataset, an identifier of the first version of the second dataset, a first branch name, and an identifier of the first version of the driver program; storing a second version of the first dataset that is derived from a second version of the second dataset based on a second execution of the first version of the driver program; storing a second build catalog entry comprising an identifier of the second version of the first dataset, an identifier of the second version of the second dataset, a second branch name that is different from the first branch name, and an identifier of the first version of the driver program; storing a first transaction entry in a database, the first transaction entry comprising a first transaction commit identifier of the first version of the first dataset; wherein the first build catalog entry comprises the first transaction commit identifier; storing a second transaction entry in the database, the second transaction entry comprising a second transaction commit identifier of the first version of the second dataset; wherein the identifier of the first version of the second dataset in the first build catalog entry is the second transaction commit identifier; storing a third transaction entry in the database, the third transaction entry comprising a third transaction commit identifier of the second version of the second dataset; wherein the identifier of the second version of the second dataset in the second build catalog entry is the third transaction commit identifier; and causing display of a provenance graph in a graphical user interface based on the first build catalog entry, the provenance graph display including display of: a first node representing the first version of the first dataset, a second node representing the first version of the second dataset, and a first directed edge from the first node to the second node. 2. The method of claim 1 , further comprising storing the first version of the first dataset in a distributed file system. 3. The method of claim 1 , wherein the identifier of the first version of the first dataset is an identifier assigned to a commit of a transaction in context of which the first version of the first dataset is stored. 4. The method of claim 1 , wherein the first version of the driver program, when executed to produce the first version of the first dataset, transforms data of the first version of the second dataset to produce data of the first version of the first dataset. 5. One or more non-transitory computer-readable media storing a set of instructions for execution by one or more processors, the set of instructions configured for performing operations comprising: storing a first version of a first dataset that is derived from a first version of a second dataset based on a first execution of a first version of a driver program; storing a first build catalog entry comprising an identifier of the first version of the first dataset, an identifier of the first version of the second dataset, a first branch name, and an identifier of the first version of the driver program; storing a second version of the first dataset that is derived from a second version of the second dataset based on a second execution of the first version of the driver program; storing a second build catalog entry comprising an identifier of the second version of the first dataset, an identifier of the second version of the second dataset, a second branch name that is different from the first branch name, and an identifier of the first version of the driver program; storing a first transaction entry in a database, the first transaction entry comprising a first transaction commit identifier of the first version of the first dataset; wherein the first build catalog entry comprises the first transaction commit identifier; storing a second transaction entry in the database, the second transaction entry comprising a second transaction commit identifier of the first version of the second dataset; wherein the identifier of the first version of the second dataset in the first build catalog entry is the second transaction commit identifier; storing a third transaction entry in the database, the third transaction entry comprising a third transaction commit identifier of the second version of the second dataset; wherein the identifier of the second version of the second dataset in the second build catalog entry is the third transaction commit identifier; and causing display of a provenance graph in a graphical user interface based on the second build catalog entry, the provenance graph display including display of: a first node representing the second version of the first dataset, a second node representing the second version of the second dataset, and a first directed edge from the first node to the second node. 6. The one or more non-transitory computer-readable media of claim 5 , wherein the operations further comprise storing the first version of the first dataset in a distributed file system. 7. The one or more non-transitory computer-readable media of claim 5 , wherein the identifier of the first version of the first dataset is an identifier assigned to a commit of a transaction in context of which the first version of the first dataset is stored. 8. The one or more non-transitory computer-readable media of claim 5 , wherein the first version of the driver program, when executed to produce the first version of the first dataset, transforms data of the first version of the second dataset to produce data of the first version of the first dataset.

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • Physics · mapped topic

  • Physics · mapped topic

  • Physics · mapped topic

  • G06F16/219Primary

    Managing data history or versioning (querying versioned data G06F16/2474; querying temporal data G06F16/2477) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10007674B2 cover?
A computer-implemented system and method for data revision control in a large-scale data analytic systems. In one embodiment, for example, a computer-implemented method comprises the operations of storing a first version of a dataset that is derived by executing a first version of driver program associated with the dataset; and storing a first build catalog entry comprising an identifier of the…
Who is the assignee on this patent?
Palantir Technologies Inc, Palantir Technologies Inc
What technology area does this patent fall under?
Primary CPC classification G06F17/3023. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 26 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).