Automatic feature extraction from a relational database
US-2022035842-A1 · Feb 3, 2022 · US
US12561345B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12561345-B2 |
| Application number | US-202318485348-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 12, 2023 |
| Priority date | Oct 12, 2023 |
| Publication date | Feb 24, 2026 |
| Grant date | Feb 24, 2026 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer-implemented method for generating an artificial data set is provided. Aspects include obtaining an input data set, calculating an association between the plurality of categorical variables of the input data set, and creating, based on the association, a plurality of clusters of categorical variables. Aspects also include identifying a key variable for each of the plurality of clusters of categorical variables, creating a key cluster for each of the plurality of clusters, and creating a cluster contingency table for each of the clusters. Aspects further include generating, based on the cluster contingency table for each of the plurality of clusters and for the key cluster, a data set for each of the plurality of clusters and the key cluster and generating the artificial data set based on a combination of the data set for each of the plurality of clusters and the key cluster.
Opening claim text (preview).
What is claimed is: 1 . A computer-implemented method for generating an artificial data set with improved data privacy and reduced computational resource requirements, the method comprising: obtaining an input data set having a plurality of entries, wherein each entry includes a plurality of categorical variables; calculating, based on the input data set, an association between each of the plurality of categorical variables using a statistical correlation technique that identifies relationships between variables; creating, based on the association, a plurality of clusters of categorical variables, wherein each of the plurality of clusters includes at least a number of the plurality of categorical variables to reduce dimensionality of the data processing; dynamically subdividing one or more of the plurality of clusters, prior to generating a contingency table for the cluster, until a size of a contingency table for each resulting cluster is less than or equal to an available volatile memory of a computing system at the time of table creation; identifying a key variable for each of the plurality of clusters of categorical variables from the at least the number of the plurality of categorical variables to serve as a representative variable that maintains statistical properties of the cluster; creating a key cluster including the key variable for each of the plurality of clusters to enable efficient joining of data while preserving statistical relationships; creating a cluster contingency table for each of the plurality of clusters and for the key cluster, wherein the cluster contingency tables include combinations of values of the categorical variables obtained from the input data set and corresponding frequency of each combination to transform raw data into a structured format that preserves statistical distributions; generating, based on the cluster contingency table for each of the plurality of clusters and for the key cluster, a data set for each of the plurality of clusters and the key cluster that maintains statistical properties of the original data while removing direct connections to actual data entries; and generating the artificial data set based on a combination of the data set for each of the plurality of clusters and the key cluster, wherein the artificial data set maintains statistical properties of the input data set while providing enhanced data privacy and enabling analysis without exposing original sensitive information. 2 . The computer-implemented method of claim 1 , wherein the key variable for a cluster of categorical variables is identified as the categorical variable of the cluster having a largest average association with other categorical variables of the cluster. 3 . The computer-implemented method of claim 1 , wherein the association between each of the plurality of categorical variables is calculated as a Cramér's V value. 4 . The computer-implemented method of claim 1 , wherein the cluster contingency table for each of the plurality of clusters includes combinations with corresponding frequency values above a threshold minimum. 5 . The computer-implemented method of claim 4 , wherein the threshold minimum is received from a user. 6 . The computer-implemented method of claim 1 , wherein generating the artificial data set includes joining the data set for each of the plurality of clusters using the data set of the key cluster as a join key. 7 . The computer-implemented method of claim 1 , further comprising receiving a requested number of records in the artificial data set, and wherein generating the data set for each of the plurality of clusters and the key cluster includes generating the requested number of records for each of the plurality of clusters and the key cluster. 8 . A computer program product having one or more computer readable storage media having computer readable program code collectively stored on the one or more computer readable storage media, the computer readable program code being executed by a processor of a computer system to cause the computer system to perform operations for generating an artificial data set with improved data privacy and reduced computational resource requirements comprising: obtaining an input data set having a plurality of entries, wherein each entry includes a plurality of categorical variables; calculating, based on the input data set, an association between each of the plurality of categorical variables using a statistical correlation technique that identifies relationships between variables; creating, based on the association, a plurality of clusters of categorical variables, wherein each of the plurality of clusters includes at least a number of the plurality of categorical variables to reduce dimensionality of the data processing; dynamically subdividing one or more of the plurality of clusters, prior to generating a contingency table for the cluster, until a size of a contingency table for each resulting cluster is less than or equal to an available volatile memory of a computing system at the time of table creation; identifying a key variable for each of the plurality of clusters of categorical variables from the at least the number of the plurality of categorical variables to serve as a representative variable that maintains statistical properties of the cluster; creating a key cluster including the key variable for each of the plurality of clusters to enable efficient joining of data while preserving statistical relationships; creating a cluster contingency table for each of the plurality of clusters and for the key cluster, wherein the cluster contingency tables include combinations of values of the categorical variables obtained from the input data set and corresponding frequency of each combination to transform raw data into a structured format that preserves statistical distributions; generating, based on the cluster contingency table for each of the plurality of clusters and for the key cluster, a data set for each of the plurality of clusters and the key cluster that maintains statistical properties of the original data while removing direct connections to actual data entries; and generating the artificial data set based on a combination of the data set for each of the plurality of clusters and the key cluster, wherein the artificial data set maintains statistical properties of the input data set while providing enhanced data privacy and enabling analysis without exposing original sensitive information. 9 . The computer program product of claim 8 , wherein the key variable for a cluster of categorical variables is identified as the categorical variable of the cluster having a largest average association with other categorical variables of the cluster. 10 . The computer program product of claim 8 , wherein the association between each of the plurality of categorical variables is calculated as a Cramér's V value. 11 . The computer program product of claim 8 , wherein the cluster contingency table for each of the plurality of clusters includes combinations with corresponding frequency values above a threshold minimum. 12 . The computer program product of claim 11 , wherein the threshold minimum is received from a user. 13 . The computer program product of claim 8 , wherein generating the artificial data set includes joining the data set for each of the plurality of clusters using the data set of the key cluster as a join key. 14 . The computer program product of claim 8 , wherein the operations further comprise receiving a requested number of records in the artificial data set, and wherein generating the data set for e
Join operations · CPC title
Entity relationship models · CPC title
Clustering or classification · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.