Anonymization for data having a relational part and sequential part

US9230132B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9230132-B2
Application numberUS-201314132945-A
CountryUS
Kind codeB2
Filing dateDec 18, 2013
Priority dateDec 18, 2013
Publication dateJan 5, 2016
Grant dateJan 5, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system, method and computer program product for anonymizing data. Datasets anonymized according to the method have a relational part having multiple tables of relational data, and a sequential part having tables of time-ordered data. The sequential part may include data representing a “sequences-of-sequences”. A “sequence-of-sequences” is a sequence which, itself, consists of a number of sequences. Each of these kinds of data may be anonymized using k-anonymization techniques and offers privacy protection to individuals or entities from attackers whose knowledge spans the two (or more) kinds of attribute data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of anonymizing data comprising: receiving at a hardware processor, input comprising a dataset having both a relational data part and a sequential data part, the sequential part is data representing a sequence-of-sequences in which a sequence comprises elements that are sequences; identifying from said dataset direct identifier attributes corresponding to entities; masking or suppressing attribute values corresponding to said identified direct identifier attributes; ranking records based on a similarity with respect to a defined cost function F; selecting and iteratively anonymizing each set of at least k first records as ranked using the defined cost function F, each set of at least k records comprising a group, said anonymizing attribute values along both the relational part and the sequential part, wherein k is a specified k-anonymization parameter; and repeating said selecting and iteratively anonymizing each successive set of at least k records of successive groups said anonymizing attribute values along both the relational part and the sequential part of records therein to generate anonymized table representations of said dataset resulting from said anonymization, and outputting said anonymized table representations to an output device, said anonymized table representations guaranteeing no attacker can re-identify the direct identifier attributes of any entity in the dataset with a certain probability. 2. The method as claimed in claim 1 , wherein said dataset comprises: data from a set of relational data tables, one or more said relational data tables having one or more multiple time-ordered records for an entity forming the sequential part, each record having a set of attributes; and at least two tables having one record per entity forming the relational part, each record having a further set of attributes, said method further generating, by said hardware processor, an intermediate representation of the dataset having both the relational data part and the sequential data part. 3. The method as claimed in claim 2 , further comprising: determining if any further records for an entity remain after said repeating said anonymization; and if so, for each remaining record, determining a relevant anonymous group for the remaining record, and assigning the remaining record to the most relevant anonymous group. 4. The method as claimed in claim 3 , further comprising: determining, for each record of the group: whether an attribute A is a numerical attribute value; and applying an aggregate function f A to numerical attribute values of the records in said group; replacing a corresponding value of function f A for an original value in the records. 5. The method as claimed in claim 4 , wherein the applying an aggregate function f A to the numerical attribute values includes one or more of: computing a mean of ages of some or all k individuals in said group; or computing a mean or a randomized-average of recorded date or time stamp events of some or all k individuals in said group. 6. The method as claimed in claim 3 , further comprising: determining, for each record of the group: whether an attribute A is a categorical attribute; and applying an aggregate function f A to categorical attribute values of the records in said group; replacing a corresponding value of function f A for the original value in the records. 7. The method as claimed in claim 6 , wherein the applying an aggregate function f A to said categorical attribute values includes one or more of: creating a new categorical value which does not belong to a domain of the attribute A; and replacing the original value of attribute A with the new created value for the records that belong to said group. 8. The method as claimed in claim 1 , wherein said anonymizing attribute values along both the relational part and the sequential part generates anonymized attribute values, said method further comprising: generating a mapping table, said mapping table mapping original attribute values of a dataset table with said anonymized attribute values. 9. The method as claimed in claim 8 , further comprising: storing the resulting anonymized tables of the data set in their original form, along with their corresponding mapping tables. 10. The method as claimed in claim 8 , wherein said ranking comprises: quantifying, using a first cost function, a similarity between two records; and optionally using a second cost function for quantifying the similarity of two elements/sequences. 11. The method as claimed in claim 1 , further comprising: identifying from the dataset quasi-identifier attributes of said entities; and, masking or suppressing attribute values of said quasi-identifier attributes. 12. The method as claimed in claim 1 , wherein said anonymized table representations protect data in the dataset from attackers who know one or more of: values of all explicit identifiers of an individual or entity; values of all quasi-identifying relational attributes of the individual or entity; a sequence of the individual or entity; and the number of elements for an individual record. 13. A system for anonymizing data comprising: a memory; a hardware processor coupled to the memory for receiving instructions configuring said hardware processor to perform a method comprising: receiving an input comprising a dataset having both a relational data part and a sequential data part, the sequential part is data representing a sequence-of-sequences in which a sequence comprises elements that are sequences; identifying from said dataset direct identifier attribute values corresponding to entities; masking or suppressing attribute values corresponding to said identified direct identifier attributes; ranking records based on a similarity with respect to a defined cost function F; selecting and iteratively anonymizing each set of at least k first records as ranked using the defined cost function F, each set of at least k records comprising a group, said anonymizing attribute values along both the relational part and the sequential part, wherein k is a specified k-anonymization parameter; repeating said selecting and iteratively anonymizing each successive set of at least k records of successive groups said anonymizing attribute values along both the relational part and the sequential part of records therein to generate anonymized table representations of said dataset resulting from said anonymization, and outputting said anonymized table representations to an output device, said anonymized table representations guaranteeing no attacker can re-identify the direct identifier attributes of any entity in the dataset with a certain probability. 14. The system as claimed in claim 13 , wherein said dataset comprises: data from a set of relational data tables, one or more relational data tables having one or more multiple time-ordered records for an entity forming the sequential part, each record having a set of attributes; and at least two tables having one record per entity forming the relational part, each record having a further set of attributes, the method further comprising: generating, by said hardware processor, an intermediate representation of the dataset having both the relational data part and the sequential data part. 15. The system as claimed in claim 14 , said hardware processor configured to further perform: determining if any further records for an entity remain after said repeating said anonymization; and if so, for each remaining record, determining a relevant anonymous group for the remaining record, an

Assignees

Inventors

Classifications

  • where protection concerns the structure of data, e.g. records, types, queries · CPC title

  • by anonymising data, e.g. decorrelating personal data from the owner's identification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9230132B2 cover?
A system, method and computer program product for anonymizing data. Datasets anonymized according to the method have a relational part having multiple tables of relational data, and a sequential part having tables of time-ordered data. The sequential part may include data representing a “sequences-of-sequences”. A “sequence-of-sequences” is a sequence which, itself, consists of a number of sequ…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F21/6227. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 05 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).