Automated database schema matching
US-2020081899-A1 · Mar 12, 2020 · US
US11574186B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11574186-B2 |
| Application number | US-201916669894-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 31, 2019 |
| Priority date | Oct 31, 2019 |
| Publication date | Feb 7, 2023 |
| Grant date | Feb 7, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Computer systems, methods and program products for automating pseudonymization of personal identifying information (PII) using machine learning, metadata, and crowdsourcing patterns to identify and replace PII. Machine learning models are trained for classifying known column names or key names for processing, using metadata. Column or key names are classified to be unprocessed, anonymized or pseudonymized by a pseudonymizer without revealing PII or scrubbing data into a useless format. A library of crowdsourced patterns are utilized for matching PII to data values within column or key names and PII is mapped to replacement methods. Feedback from user annotations retrains the algorithms to improve classification accuracy and Deep Learning algorithms automate the identification of PII using regular expression generation to concisely articulate how pseudonymizers search for PII patterns within a data set. PII replacement is mapped consistently across entire data packages and the crowdsourced pattern library is updated with generated regular expressions.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: analyzing, by a processor, metadata of a data package comprising personal identifying information; extracting, by the processor, column names or key names from the metadata; mapping, by the processor, the column names or the key names, to a classification indicating whether data values associated with the column names or the key names are configured to be unprocessed, anonymized or pseudonymized during processing of the data values; outputting, by the processor, a configuration file instructing a pseudonymizer to pseudonymize the data values associated with the column names or the key names classified to be pseudonymized; parsing, by the processor, the data values associated with the column names or the key names configured to be pseudonymized; matching, by the processor, a personal identifying information pattern stored by a pattern library to personal identifying information stored within the data values configured to be pseudonymized; mapping, by the processor, a replacement method to the personal identifying information stored in the data values using a classification model and the personal identifying information pattern as a reference to map the replacement method; and updating, by the processor, the configuration file to include the personal identifying information pattern for recognizing the personal identifying information associated with the column names or the key names and the replacement method for pseudonymizing the data values identified as the personal identifying information consistently across all datasets of the data package. 2. The computer-implemented method of claim 1 , further comprising: receiving, by the processor, user feedback amending the classification of one or more column name or key name; retraining, by the processor, a classification model as a function of the user feedback; and outputting, by the processor, a revised configuration file adopting amendments to the classification of the one or more column name or key name. 3. The computer-implemented method of claim 1 , wherein the personal identifying information pattern of the pattern library for matching the data values to personal identifying information is a regular expression. 4. The computer-implemented method of claim 1 , wherein the pattern library is generated by crowdsourcing user-submitted patterns for identifying one or more types of personal identifying information from a plurality of users. 5. The computer-implemented method of claim 1 , further comprising: receiving, by the processor, user annotations modifying the personal identifying information identified by the configuration file; loading, by the processor, user-annotated personal identifying information into a plurality of regular expression generators; generating, by the processor, a regular expression by each of the regular expression generators, wherein at least two different regular expressions are outputted by the regular expression generators that capture user-annotated personal identifying information from the data package; generating, by the processor, a new regular expression using an ensemble of deep learning models based on the at least two different regular expressions, wherein said new regular expression captures the user-annotated personal identifying information stored within the column names or the key names of the data package of the at least two different regular expressions using a single regular expression; updating, by the processor, the configuration file to further comprise the new regular expression; and pseudonymizing, by the processor, the data values of each column name and key name consistently across the data package, in accordance with instructions provided by the configuration file. 6. The computer-implemented method of claim 5 , further comprising: modifying, by the processor, the pattern library by adding the new regular expression associated with the user-annotated personal identifying information. 7. A computer system comprising: a processor; and a computer-readable storage media coupled to the processor, wherein the computer-readable storage media contains program instructions executing a computer-implemented method comprising: analyzing, by the processor, metadata of a data package comprising personal identifying information; extracting, by the processor, column names or key names from the metadata; mapping, by the processor, the column names or the key names, to a classification indicating whether data values associated with the column names or the key names are configured to be unprocessed, anonymized or pseudonymized during processing of the data values; outputting, by the processor, a configuration file instructing a pseudonymizer to pseudonymize the data values associated with the column names or the key names classified to be pseudonymized; parsing, by the processor, the data values associated with the column names or the key names configured to be pseudonymized; matching, by the processor, a personal identifying information pattern stored by a pattern library to personal identifying information stored within the data values configured to be pseudonymized; mapping, by the processor, a replacement method to the personal identifying information stored in the data values using a classification model and the personal identifying information pattern as a reference to map the replacement method; and updating, by the processor, the configuration file to include the personal identifying information pattern for recognizing the personal identifying information associated with the column names or the key names and the replacement method for pseudonymizing the data values identified as the personal identifying information consistently across all datasets of the data package. 8. The computer system of claim 7 , further comprising: receiving, by the processor, user feedback amending the classification of one or more column name or key name; retraining, by the processor, a classification model as a function of the user feedback; and outputting, by the processor, a revised configuration file adopting amendments to the classification of the one or more column name or key name. 9. The computer system of claim 7 , wherein the personal identifying information pattern of the pattern library for matching the data values to personal identifying information is a regular expression. 10. The computer system of claim 7 , wherein the pattern library is generated by crowdsourcing user-submitted patterns for identifying one or more types of personal identifying information from a plurality of users. 11. The computer system of claim 7 , further comprising: receiving, by the processor, user annotations modifying the personal identifying information identified by the configuration file; loading, by the processor, user-annotated personal identifying information into a plurality of regular expression generators; generating, by the processor, a regular expression by each of the regular expression generators, wherein at least two different regular expressions are outputted by the regular expression generators that capture user-annotated personal identifying information from the data package; generating, by the processor, a new regular expression using an ensemble of deep learning models based on the at least two different regular expressions, wherein said new regular expression captures the user-annotated personal identifying information stored within the column names or the key names of the data package of the at least two different regular expressions using a single regular expression; updating, by the processor, the configuration
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Supervised learning · CPC title
Parsing · CPC title
based on feedback from supervisors · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.