Propagated deletion of database records and derived data

US10956406B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10956406-B2
Application numberUS-201815990338-A
CountryUS
Kind codeB2
Filing dateMay 25, 2018
Priority dateJun 12, 2017
Publication dateMar 23, 2021
Grant dateMar 23, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Using a distributed database system that manages a plurality of different raw datasets and a plurality of derived datasets that have been derived from the raw datasets based on a plurality of derivation relationships that link the raw datasets to the derived datasets, a subset of records that are candidates for propagated deletion of specified data values is determined. One or more particular raw datasets that contain the subset of records is determined. The specified data values from the particular raw datasets is deleted. Based on the plurality of derivation relationships and the particular raw datasets, one or more particular derived datasets that have been derived from the particular raw datasets is identified. A build of one or more particular derived datasets to result in creating and storing one or more particular derived datasets without the specified data values deleted from the particular raw datasets is generated and executed.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: using a distributed database system that is programmed to manage a plurality of different raw datasets and a plurality of derived resilient distributed datasets that have been derived from the plurality of different raw datasets based on a plurality of derivation relationships that link the plurality of different raw datasets to the plurality of derived resilient distributed datasets; determining one or more particular raw datasets of the plurality of different raw datasets that contain a subset of records that are candidates for propagated deletion of specified data values; deleting the specified data values from the one or more particular raw datasets; based on one or more of the plurality of derivation relationships, identifying one or more particular derived resilient distributed datasets, of the plurality of derived resilient distributed datasets, that have been derived from the one or more particular raw datasets; wherein each particular derived resilient distributed dataset of the one or more particular derived resilient distributed datasets is a read-only partitioned collection of records in the distributed database system; generating and executing a particular build of the one or more particular derived resilient distributed datasets from the one or more particular raw datasets from which the specified data values are deleted to result in creating and storing one or more new particular derived resilient distributed datasets without the specified data values that were deleted from the one or more particular raw datasets; and deleting the specified data values from one or more historical builds of the one or more particular derived resilient distributed datasets from the one or more particular raw datasets that were built prior to the particular build; wherein the method is performed using one or more processors. 2. The method of claim 1 , further comprising deleting one or more historical builds of the same derived resilient distributed datasets for which a build was generated and executed. 3. The method of claim 2 , further comprising: determining the one or more particular raw datasets that contain the subset of records by accessing metadata in the distributed database system to determine one or more raw datasets in which a copy or trace of the subset of records resides; as part of the deleting the one or more historical builds, also deleting any metadata in the distributed database system that contain data traces from the subset of records that are candidates for propagated deletion of the specified data values. 4. The method of claim 1 , wherein the distributed database system has a property of immutability. 5. The method of claim 1 , wherein the steps of identifying, generating, and executing comprise traversing a directed graph, which is represented in metadata stored in the distributed database system, in which nodes represent raw datasets or derived resilient distributed datasets and in which links represent derivation relationships of the nodes. 6. The method of claim 1 , wherein the steps of identifying, generating, and executing comprise inspecting metadata stored in the plurality of derived resilient distributed datasets that specifies which of the plurality of different raw datasets were sources for the plurality of derived resilient distributed datasets. 7. The method of claim 1 , wherein the steps of the method are executed via programmatic cooperation of an SQL interface, a core resilient distributed dataset processor, one or more worker processes and the distributed database system. 8. The method of claim 1 , wherein at least one of the plurality of derived resilient distributed datasets is managed in a POSTGRES system. 9. The method of claim 1 , further comprising receiving a deletion request that specifies parameters of data to be deleted via a programmatic call, input at a host computing device, from a cron job or from a script that executes according to a schedule. 10. A computer system comprising: one or more processors; one or more storage media; one or more sequences of instructions stored in the one or more storage media which, when executed by the one or more processors, cause performance of: using a distributed database system that is programmed to manage a plurality of different raw datasets and a plurality of derived resilient distributed datasets that have been derived from the plurality of different raw datasets based on a plurality of derivation relationships that link the plurality of different raw datasets to the plurality of derived resilient distributed datasets; determining one or more particular raw datasets of the plurality of different raw datasets that contain a subset of records that are candidates for propagated deletion of specified data values; deleting the specified data values from the one or more particular raw datasets; based on one or more of the plurality of derivation relationships, identifying one or more particular derived resilient distributed datasets, of the plurality of derived resilient distributed datasets, that have been derived from the one or more particular raw datasets; wherein each particular derived resilient distributed dataset of the one or more particular derived resilient distributed datasets is a read-only partitioned collection of records in the distributed database system; generating and executing a particular build of the one or more particular derived resilient distributed datasets from the one or more particular raw datasets from which the specified data values are deleted to result in creating and storing one or more new particular derived resilient distributed datasets without the specified data values that were deleted from the one or more particular raw datasets; and deleting the specified data values from one or more historical builds of the one or more particular derived resilient distributed datasets from the one or more particular raw datasets that were built prior to the particular build. 11. The system of claim 10 , further comprising sequences of instructions which when executed cause deleting one or more historical builds of the same derived resilient distributed datasets for which a build was generated and executed. 12. The system of claim 11 , further comprising sequences of instructions which when executed cause: determining the one or more particular raw datasets that contain the subset of records by accessing metadata in the distributed database system to determine one or more raw datasets in which a copy or trace of the subset of records resides; as part of the deleting the one or more historical builds, also deleting any metadata in the distributed database system that contain data traces from the subset of records that are candidates for propagated deletion of the specified data values. 13. The system of claim 10 , wherein the distributed database system has a property of immutability. 14. The system of claim 10 , wherein the sequences of instructions which cause identifying, generating, and executing further comprise sequences of instructions which when executed cause traversing a directed graph, which is represented in metadata stored in the distributed database system, in which nodes represent raw datasets or derived resilient distributed datasets and in which links represent derivation relationships of the nodes. 15. The system of claim 10 , wherein the sequences of instructions which cause identifying, generating, and executing comprise sequences of instructions which when executed cause inspecting metadata stored in the pluralit

Assignees

Inventors

Classifications

  • G06F16/254Primary

    Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

  • Bulk updating operations (data conversion details G06F16/258) · CPC title

  • Query languages · CPC title

  • Ensuring data consistency and integrity · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10956406B2 cover?
Using a distributed database system that manages a plurality of different raw datasets and a plurality of derived datasets that have been derived from the raw datasets based on a plurality of derivation relationships that link the raw datasets to the derived datasets, a subset of records that are candidates for propagated deletion of specified data values is determined. One or more particular r…
Who is the assignee on this patent?
Palantir Technologies Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/254. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 23 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).