Data lineage tracking

US9659042B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9659042-B2
Application numberUS-201213494449-A
CountryUS
Kind codeB2
Filing dateJun 12, 2012
Priority dateJun 12, 2012
Publication dateMay 23, 2017
Grant dateMay 23, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data lineage tracking system may include a memory storing a module comprising machine readable instructions to obtain trace log entries representing an interaction with, a manipulation of, and/or a creation of a data value. The data lineage tracking system may further include machine readable instructions to select the trace log entries that are associated with commands performed by an application, cluster similar trace log entries from the selected trace log entries, and analyze mappings between the clustered trace log entries to determine data lineage flow associated with the data value.

First claim

Opening claim text (preview).

What is claimed is: 1. A data lineage tracking system comprising: a processor; and a memory storing machine readable instructions that when executed by the processor cause the processor to: obtain trace log entries representing at least one of an interaction with a data value, a manipulation of the data value, and a creation of the data value; select, from the obtained trace log entries, trace log entries that are associated with commands performed by an application; cluster similar trace log entries from the selected trace log entries; measure variability of temporal differences between the trace log entries in cluster pairs by calculating entropy of the temporal differences to determine a consistency of the temporal differences, wherein the entropy represents a measure of uncertainty associated with the temporal differences, a relatively high entropy score represents a high variation in the temporal differences, and a relatively low entropy score represents a low variation in the temporal differences; map a command-timestamp pair, (s 1 , t 1 ), for a cluster c 1 to another command-timestamp pair, (s 2 , t 2 ), for a cluster c 2 , when there does not exist a s 1 ′ in cluster c 1 such that |t 1 ′−t 2 |<|t 1 −t 2 |, and there does not exist a s 2 ′ in cluster c 2 such that |t 1 ′−t 2 |<|t 1 −t 2 |, wherein the s 1 is a trace log entry command from the cluster c 1 and the t 1 is a timestamp for the trace log entry command s 1 , the s 1 ′ is a trace log entry command from the cluster c 1 and the t 1 ′ is a timestamp for the trace log entry command s 1 ′, the s 2 is a trace log entry command from the cluster c 2 and the t 2 is a timestamp for the trace log entry command s 2 , and the s 2 ′ is a trace log entry command from the cluster c 2 ; analyze the mappings between the clustered trace log entries to determine data lineage flow associated with the data value by identifying each cluster of a plurality of clusters for which an entropy falls below a predetermined entropy threshold, wherein entropies below the predetermined entropy threshold represent a low entropy, and constructing a cluster chain including clusters with the low entropies to generate the data lineage flow; determine data value lineage by determining a first command associated with at least one of an interaction with, a manipulation of, and a creation of the data value, determining a second command associated with at least one of an interaction with and a manipulation of the data value, and linking the second command to the first command; determine, based on the data value lineage associated with the data value, whether the data value is authentic; and in response to a determination that the data value is authentic, generate, based on the data value, a report with respect to different systems associated with the data value and the application. 2. The data lineage tracking system of claim 1 , wherein the similar trace log entries are clustered based on at least one of a command type, a table name, and an attribute name. 3. The data lineage tracking system of claim 1 , wherein the machine readable instructions to determine the data value lineage further comprise machine readable instructions that when executed by the processor further cause the processor to: link the second command to the first command by setting a reference value for the second command to a unique identification (ID) for the first command. 4. The data lineage tracking system of claim 1 , further comprising machine readable instructions that when executed by the processor further cause the processor to: determine a reason for a command of the commands based on an analysis of an asset, a resource and the application registered with the data lineage tracking system, wherein the reason for the command is based on a historical analysis of interactions with the asset, the resource and the application. 5. The data lineage tracking system of claim 1 , further comprising machine readable instructions that when executed by the processor further cause the processor to: identify an anomaly in the data value lineage based on a determination of whether a change in the data value exceeds a predetermined percentage. 6. The data lineage tracking system of claim 1 , further comprising machine readable instructions that when executed by the processor further cause the processor to: generate a graph illustrating the data lineage flow identifying at least one of an asset, a resource and the application that have interacted with the data value. 7. The data lineage tracking system of claim 1 , further comprising machine readable instructions that when executed by the processor further cause the processor to: receive calls from data sources, wherein the calls include structured query language (SQL) queries and NoSQL inserts and updates. 8. The data lineage tracking system of claim 1 , further comprising machine readable instructions that when executed by the processor further cause the processor to: poll data sources for structured query language (SQL) queries and NoSQL inserts and updates. 9. A data lineage tracking system comprising: a processor; and a memory storing machine readable instructions that when executed by the processor cause the processor to: obtain trace log entries representing at least one of an interaction with a data value, a manipulation of the data value, and a creation of the data value; select, from the obtained trace log entries, trace log entries that are associated with commands performed by an application; cluster similar trace log entries from the selected trace log entries; measure variability of temporal differences between the trace log entries in cluster pairs by calculating entropy of the temporal differences to determine a consistency of the temporal differences, wherein the entropy represents a measure of uncertainty associated with the temporal differences, a relatively high entropy score represents a high variation in the temporal differences, and a relatively low entropy score represents a low variation in the temporal differences; map a command-timestamp pair, (s 1 , t 1 ), for a cluster c 1 to another command-timestamp pair, (s 2 , t 2 ), for a cluster c 2 , when there does not exist a s 1 ′ in cluster c 1 such that |t 1 ′−t 2 |<|t 1 −t 2 |, and there does not exist a s 2 ′ in cluster c 2 such that |t 1 ′−t 2 |<|t 1 −t 2 |,wherein the s 1 is a trace log entry command from the cluster c 1 and the t 1 is a timestamp for the trace log entry command s 1 , the s 1 ′ is a trace log entry command from the cluster c 1 and the t 1 ′ is a timestamp for the trace log entry command s 1 ′, the s 2 is a trace log entry command from the cluster c 2 and the t 2 is a timestamp for the trace log entry command s 2 , and the s 2 ′ is a trace log entry command from the cluster c 2 ; analyze the mappings between the clustered trace log entries to determine data lineage flow associated with the data value by identifying each cluster of a plurality of clusters for which an entropy falls below a predetermined entropy threshold, wherein entropies below the predetermined entropy threshold represent a low entropy, and constructing a cluster chain including clusters with the low entropies to generate the data lineage flow; determine data value lineage by determining a first command associated with at least one of an interaction with, a manipulation of, and a creation of the data value, determining a second command associated with at least one of an interaction with and a manipulation of the data value, and linking the second command to the first command by setting a reference value for the second command to a unique identification (ID) for the

Assignees

Inventors

Classifications

  • G06F16/219Primary

    Managing data history or versioning (querying versioned data G06F16/2474; querying temporal data G06F16/2477) · CPC title

  • Clustering; Classification · CPC title

  • Change logging, detection, and notification (replication G06F16/27) · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9659042B2 cover?
A data lineage tracking system may include a memory storing a module comprising machine readable instructions to obtain trace log entries representing an interaction with, a manipulation of, and/or a creation of a data value. The data lineage tracking system may further include machine readable instructions to select the trace log entries that are associated with commands performed by an applic…
Who is the assignee on this patent?
Puri Colin A, Kim Doo Soon, Yeh Peter Z, and 2 more
What technology area does this patent fall under?
Primary CPC classification G06F16/219. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 23 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).