Efficient statistical techniques for detecting sensitive data

US11599667B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11599667-B1
Application numberUS-202016990809-A
CountryUS
Kind codeB1
Filing dateAug 11, 2020
Priority dateAug 11, 2020
Publication dateMar 7, 2023
Grant dateMar 7, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A candidate attribute combination of a first data set is identified, such that the candidate attribute combination meets a data type similarity criterion with respect to a collection of data types of sensitive information for which the first data set is to be analyzed. A collection of input features is generated for a machine learning model from the candidate attribute combination, including at least one feature indicative of a statistical relationship between the values of the candidate attribute combination and a second data set. An indication of a predicted probability of a presence of sensitive information in the first data set is obtained using the machine learning model.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: one or more computing devices; wherein the one or more computing devices include instructions that upon execution on or across the one or more computing devices cause the one or more computing devices to: obtain a first data set indicating human population density as a function of geographical location; identify a plurality of structured data objects to be analyzed for a presence of geographical location details pertaining to individuals, wherein the geographical location details are expressed using a plurality of numeric data types, wherein individual ones of the structured data objects comprise a plurality of records, and wherein individual ones of plurality of records comprise values of a plurality of attributes; select a sample of records from a particular structured data object of the plurality of structured data objects; identify one or more candidate attribute combinations from the plurality of attributes of the records of the sample, wherein (a) individual ones of the candidate attribute combinations meet a data type similarity criterion with respect to the plurality of numeric data types and (b) attribute values of individual ones of the candidate attribute combinations satisfy one or semantic filtration criteria associated with geographical location details; generate, corresponding to individual ones of the one or more candidate attribute combinations and the first data set, a collection of input features for a classification model, including at least one feature indicative of a statistical relationship between human population density and attribute values of the candidate attribute combinations; and transmit an indication of a probability of a presence of geographical location details in the particular structured data object, wherein the probability is obtained from the classification model using at least the collection of input features. 2. The system as recited in claim 1 , wherein the one or more computing devices include further instructions that upon execution on or across the one or more computing devices further cause the one or more computing devices to: obfuscate, by applying one or more transformation operations, raw values of one or more attributes of the plurality of attributes, such that obfuscated versions of the raw values are used to generate the collection of input features. 3. The system as recited in claim 1 , wherein the one or more computing devices include further instructions that upon execution on or across the one or more computing devices further cause the one or more computing devices to: obtain, via one or more programmatic interfaces, a request for sensitive data presence analysis of one or more data stores, including a data store at which the plurality of structured data objects is stored, wherein the plurality of structured data objects is identified in response to the request for sensitive data presence analysis. 4. The system as recited in claim 1 , wherein to identify one or more candidate attribute combinations from the plurality of attributes, the one or more computing devices include further instructions that upon execution on or across the one or more computing devices further cause the one or more computing devices to: compute a ratio of (a) a number of records of the sample whose attribute values for a particular attribute lie within a valid range of values corresponding to a particular representation of a geographic location and (b) a number of records of the sample whose attribute values for the particular attribute are non-empty. 5. The system as recited in claim 1 , wherein to identify one or more candidate attribute combinations from the plurality of attributes, the one or more computing devices include further instructions that upon execution on or across the one or more computing devices further cause the one or more computing devices to: compute a similarity metric associated with respective values of a first attribute of the plurality of attributes and a second attribute of the plurality of attributes. 6. A computer-implemented method, comprising: obtaining a first data set indicating a distribution of one or more properties of a group of entities with respect to which targeted information presence analysis is to be performed; identifying one or more candidate attribute combinations from a plurality of attributes of records of a second data set, wherein individual ones of the candidate attribute combinations meet a data type similarity criterion with respect to a collection of data types of targeted information of entities of the group of entities; generating, corresponding to individual ones of the one or more candidate attribute combinations and the first data set, a collection of input features for a machine learning model, including at least one feature indicative of a statistical relationship between the distribution of the one or more properties and attribute values of an individual candidate attribute combination; and obtaining an indication of a predicted probability of a presence of targeted information in the second data set, wherein the predicted probability is obtained from the machine learning model using at least the collection of input features. 7. The computer-implemented method as recited in claim 6 , wherein the one or more properties of the group of entities comprise a population density as a function of geographical location. 8. The computer-implemented method as recited in claim 6 , further comprising: obtaining an indication, via one or more programmatic interfaces, that a data store is to be analyzed for presence of targeted information; and selecting a subset of the data store in response to obtaining the indication, wherein the subset comprises the second data set. 9. The computer-implemented method as recited in claim 6 , further comprising: determining at least a portion of a topology of a data store comprising the second data set; and selecting the second data set from the data store based at least in part on the topology. 10. The computer-implemented method as recited in claim 6 , wherein identifying the one or more candidate attribute combinations from the plurality of attributes of records comprises: applying one or more semantic filters to values of individual attributes of the plurality of attributes, wherein the one or more semantic filters are defined based at least in part on characteristics of the targeted information. 11. The computer-implemented method as recited in claim 6 , wherein the first data set comprises a plurality of data points, and wherein generating the collection of input features comprises: assigning individual data points of the plurality of data points to respective cells of a grid; and assigning respective values of a particular candidate attribute combination to respective cells of the grid. 12. The computer-implemented method as recited in claim 11 , wherein generating the collection of input features comprises: identifying a group of cells of the grid for which counts of assigned values of the particular candidate attribute combination exceed zero; and determining a metric of correlation between (a) the counts of assigned values of the particular candidate attribute combination of individual cells of the group of cells and (b) values obtained from the data points of the first data set which were assigned to individual cells of the group of cells. 13. The computer-implemented method as recited in claim 6 , wherein generating the collection of input features comprises: applying one or more smoothing functions to values of a particular candidate attribute combination,

Assignees

Inventors

Classifications

  • Protecting personal data, e.g. for financial or medical purposes · CPC title

  • G06N20/10Primary

    using kernel methods, e.g. support vector machines [SVM] · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Approximate or statistical queries · CPC title

  • Geographical information databases · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11599667B1 cover?
A candidate attribute combination of a first data set is identified, such that the candidate attribute combination meets a data type similarity criterion with respect to a collection of data types of sensitive information for which the first data set is to be analyzed. A collection of input features is generated for a machine learning model from the candidate attribute combination, including at…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F21/6245. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 07 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).