Systems and methods for pruning data by sampling

US9600503B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9600503-B2
Application numberUS-201313951435-A
CountryUS
Kind codeB2
Filing dateJul 25, 2013
Priority dateJul 25, 2013
Publication dateMar 21, 2017
Grant dateMar 21, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques provided herein allow for management of data. In various embodiments, systems and methods prune and retain data being managed by a data management system, where the managed data can include log data aggregated from one or more servers for analysis purposes. According to some embodiments, pruning can be triggered according to one or more constraints, such as the age of managed data (e.g., retain only 30 days of managed data) or the memory space required to store the managed data (e.g., retain only 100 GB worth of managed data). The constraints that trigger data pruning can be based on a data retention policy. When triggered, pruning can be performed on a fraction of the managed data stored based on the data retention policy (e.g., 3 days of full managed data, 27 days of pruned managed data). The pruning may be performed by sampling, at a desired rate, the managed data.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer system comprising: at least one processor; and a memory storing instructions configured to instruct the at least one processor to perform: detecting when a constraint for storing a data set has been exceeded; identifying, based on the constraint, an initial data subset from the data set for each of a plurality of time periods, from which at least some data elements will be removed by sampling; determining a sampling rate for data element retention; identifying a secondary data subset from the initial data subset for each of the plurality of time periods, based on sampling the initial data subset according to the sampling rate, the sampling rate applied to the initial data subset for each of the plurality of time periods; and removing from the data set one or more data elements of the initial data subset for each of the plurality of time periods while retaining data elements of the secondary data subset for each of the plurality of time periods, wherein the sampling rate is uniform, and wherein the sampling rate is determined such that a representative portion of the data set is retained when the one or more data elements of the initial data subset for each of the plurality of time periods are removed from the data set. 2. The computer system of claim 1 , wherein the data set comprises log data. 3. The computer system of claim 2 , wherein the log data is associated with operation of a social networking system. 4. The computer system of claim 3 , wherein the log data comprises one or more time-stamped data elements regarding user activity occurring on the social networking system. 5. The computer system of claim 1 , wherein the constraint relates to age of data elements in the data set. 6. The computer system of claim 1 , wherein the constraint relates to storage space occupied by data elements in the data set. 7. The computer system of claim 1 , wherein the constraint is based on a data retention policy. 8. The computer system of claim 1 , wherein the data set comprises data sampled from a larger data set. 9. The computer system of claim 1 , wherein the initial data subset for each of the plurality of time periods is identified according to a data retention policy. 10. The computer system of claim 9 , wherein the data retention policy prohibits removal of data elements from the data set that have been maintained for less than a threshold period of time. 11. The computer system of claim 1 , wherein the sampling rate is defined by a ratio of data elements. 12. The computer system of claim 1 , wherein the sampling rate is determined based on a type of data element included in the data set. 13. The computer system of claim 12 , wherein the data set comprises event log data and the type of data element is based on an event type. 14. The computer system of claim 1 , wherein the data set is a database table. 15. The computer system of claim 14 , wherein the sampling rate is determined based on a table type associated with the database table. 16. The computer system of claim 1 , wherein the instructions are further configured to instruct the at least one processor to perform: designating data of the secondary data subset as being data retained during a data removal process. 17. The computer system of claim 1 , wherein the instructions are further configured to instruct the at least one processor to perform: associating the sampling rate with data of the secondary data subset. 18. The computer system of claim 1 , wherein the data set is being stored in an in-memory database. 19. A non-transitory computer-storage medium storing computer-executable instructions that, when executed, cause a computer system to perform a computer-implemented method comprising: detecting when a constraint for storing a data set has been exceeded; identifying, based on the constraint, an initial data subset from the data set for each of a plurality of time periods, from which at least some data elements will be removed by sampling; determining a sampling rate for data element retention; identifying a secondary data subset from the initial data subset for each of the plurality of time periods, based on sampling the initial data subset according to the sampling rate, the sampling rate applied to the initial data subset for each of the plurality of time periods; and removing from the data set one or more data elements of the initial data subset for each of the plurality of time periods while retaining data elements of the secondary data subset for each of the plurality of time periods, wherein the sampling rate is uniform, and wherein the sampling rate is determined such that a representative portion of the data set is retained when the one or more data elements of the initial data subset for each of the plurality of time periods are removed from the data set. 20. A computer implemented method comprising: detecting, by a computer system, when a constraint for storing a data set has been exceeded; identifying, by the computer system, based on the constraint, an initial data subset from the data set for each of a plurality of time periods, from which at least some data elements will be removed by sampling; determining, by the computer system, a sampling rate for data element retention; identifying, by the computer system, a secondary data subset from the initial data subset for each of the plurality of time periods, based on sampling the initial data subset according to the sampling rate, the sampling rate applied to the initial data subset for each of the plurality of time periods; and removing, by the computer system, from the data set one or more data elements of the initial data subset for each of the plurality of time periods while retaining data elements of the secondary data subset for each of the plurality of time periods, wherein the sampling rate is uniform, and wherein the sampling rate is determined such that a representative portion of the data set is retained when the one or more data elements of the initial data subset for each of the plurality of time periods are removed from the data set.

Assignees

Inventors

Classifications

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Design, administration or maintenance of databases · CPC title

  • characterised by the use of retention policies (retention policies for HSM systems G06F16/185) · CPC title

  • Triggers; Constraints · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9600503B2 cover?
Techniques provided herein allow for management of data. In various embodiments, systems and methods prune and retain data being managed by a data management system, where the managed data can include log data aggregated from one or more servers for analysis purposes. According to some embodiments, pruning can be triggered according to one or more constraints, such as the age of managed data (e…
Who is the assignee on this patent?
Facebook Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 21 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).