Lineage information management in data analytics

US9811573B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9811573-B1
Application numberUS-201314039537-A
CountryUS
Kind codeB1
Filing dateSep 27, 2013
Priority dateSep 27, 2013
Publication dateNov 7, 2017
Grant dateNov 7, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data analytics workload is obtained, wherein the data analytics workload includes one or more execution parameters and an input data set. An identifier specific to the data analytics workload is generated. The data analytics workload is at least partially executed based on the one or more execution parameters and the input data set to generate an output data set. Meta data associated with the output data set generated by execution of the data analytics workload is obtained, wherein the meta data includes lineage information corresponding to the output data set generated by execution of the input data set. The meta data is registered in a meta data store.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising steps of: obtaining an input data set and a first data analytics workload, wherein the first data analytics workload comprises one or more execution parameters for executing the first data analytics workload based on the input data set, and wherein the execution parameters comprise a location of the input data set; obtaining an identifier specific to the first data analytics workload according to one or more predefined rules, wherein obtaining the identifier comprises creating at least one content address according to and uniquely identifying both of the input data set and the one or more execution parameters to generate the identifier; commencing execution of the first data analytics workload based on the one or more execution parameters and the input data set, wherein the first data analytics workload writes output data generated during the execution of the first data analytics workload; registering meta data associated with the output data generated during the execution of the first data analytics workload at a meta data store using said at least one identifier, wherein the meta data comprises lineage information for tracing and verifying at least one of creation, movement, use and alteration of the output data generated during execution of the first data analytics workload based at least in part on said identifier; storing the meta data associated with the output data temporarily in a temporary store prior to completing execution of the first data analytics workload, wherein the lineage information is stored with both parent meta data associated with a parent file and child meta data associated with one or more child data files associated with the parent file; merging the temporarily stored meta data from the temporary store into the meta data store upon completing execution of the first data analytics workload; and performing one or more functions utilizing the lineage information from the merged meta data, the one or more functions comprising one or more of a data provenance function, a data analytics work scheduling function, and a data de-duplication function; wherein performing the data provenance function comprises replaying the first data analytics workload by tracking a footprint derived from the lineage information to validate the output data generated during the execution of the first data analytics workload; wherein performing the data analytics workload scheduling function comprises querying the meta data server to locate the lineage information responsive to re-obtaining the first data analytics workload, and returning data from an enterprise data store without re-executing the first data analytics workload responsive to locating the lineage information; wherein performing the data de-duplication function comprises, responsive to obtaining the first data analytics workload and a second data analytics workload, comparing the first identifier with a second identifier specific to the second data analytics workload and, responsive to the first identifier matching the second identifier, registering the first and second data analytics workloads to point to common data; and wherein one or more of the above steps are performed via at least one processing device. 2. The method of claim 1 , wherein the at least one processing device is part of a distributed computing platform. 3. The method of claim 2 , wherein the distributed computing platform is a massively distributed computing platform. 4. An article of manufacture comprising a processor-readable storage medium having encoded therein executable code of one or more software programs, wherein the one or more software programs when executed by the at least one processing device implement the steps of the method of claim 1 . 5. The method of claim 1 , wherein the lineage information is stored in a data analytics result hierarchy, wherein the data analytics hierarchy comprises first meta data for a first input data file and at least second meta data extending from the first meta data for at least a second input data file, the second input data file being part of the input data set of the data analytics workload. 6. The method of claim 5 , further comprising updating the output data generated during execution of the data analytics workload responsive to identifying a change to the first input data file. 7. An apparatus, comprising: a memory; and a processor operatively coupled to the memory and configured to: obtain an input data set and a first data analytics workload, wherein the first data analytics workload comprises one or more execution parameters for executing the first data analytics workload based on the input data set, and wherein the execution parameters comprise a location of the input data set; obtain an identifier specific to the first data analytics workload according to one or more predefined rules, wherein, in obtaining the identifier, the processor is configured to create at least one content address according to and uniquely identifying both of the input data set and the one or more execution parameters to generate the identifier; commence execution of the first data analytics workload based on the one or more execution parameters and the input data set, wherein the data analytics workload writes output data generated during the execution of the first data analytics workload; register meta data associated with the output data generated during the execution of the first data analytics workload at a meta data store using said at least one identifier, wherein the meta data comprises lineage information for tracing and verifying at least one of creation, movement, use and alteration of the output data generated during execution of the first data analytics workload based at least in part on said identifier; store the meta data associated with the output data temporarily in a temporary store prior to completing execution of the first data analytics workload, wherein the lineage information is stored with both parent meta data associated with a parent file and child meta data associated with one or more child data files associated with the parent file; merge the temporarily stored meta data from the temporary store into the meta data store upon completing execution of the data analytics workload; and perform one or more functions utilizing the lineage information from the merged meta data, the one or more functions comprising one or more of a data provenance function, a data analytics work scheduling function, and a data de-duplication function; wherein, in performing the data provenance function, the processor is configured to replay the first data analytics workload by tracking a footprint derived from the lineage information to validate the output data generated during the execution of the first data analytics workload; wherein, in performing the data analytics workload scheduling function, the processor is configured to query the meta data server to locate the lineage information responsive to re-obtaining the first data analytics workload, and return data from an enterprise data store without re-executing the first data analytics workload responsive to locating the lineage information; and wherein, in performing the data de-duplication function, the processor is configured to, responsive to obtaining the first data analytics workload and a second data analytics workload, compare the first identifier with a second identifier specific to the second data analytics workload and, responsive to the first identifier matching the second identifier, register the first and second data analytics workloads to point to common data. 8. The apparatus of claim 7 , wherein the processor is part of a distributed computing platform.

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • Visual data mining; Browsing structured data · CPC title

  • G06F16/25Primary

    Integrating or interfacing systems involving database management systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9811573B1 cover?
A data analytics workload is obtained, wherein the data analytics workload includes one or more execution parameters and an input data set. An identifier specific to the data analytics workload is generated. The data analytics workload is at least partially executed based on the one or more execution parameters and the input data set to generate an output data set. Meta data associated with the…
Who is the assignee on this patent?
Emc Corp, Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F17/30557. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 07 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).