Generating an artificial data set

US12561345B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12561345-B2
Application numberUS-202318485348-A
CountryUS
Kind codeB2
Filing dateOct 12, 2023
Priority dateOct 12, 2023
Publication dateFeb 24, 2026
Grant dateFeb 24, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for generating an artificial data set is provided. Aspects include obtaining an input data set, calculating an association between the plurality of categorical variables of the input data set, and creating, based on the association, a plurality of clusters of categorical variables. Aspects also include identifying a key variable for each of the plurality of clusters of categorical variables, creating a key cluster for each of the plurality of clusters, and creating a cluster contingency table for each of the clusters. Aspects further include generating, based on the cluster contingency table for each of the plurality of clusters and for the key cluster, a data set for each of the plurality of clusters and the key cluster and generating the artificial data set based on a combination of the data set for each of the plurality of clusters and the key cluster.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method for generating an artificial data set with improved data privacy and reduced computational resource requirements, the method comprising: obtaining an input data set having a plurality of entries, wherein each entry includes a plurality of categorical variables; calculating, based on the input data set, an association between each of the plurality of categorical variables using a statistical correlation technique that identifies relationships between variables; creating, based on the association, a plurality of clusters of categorical variables, wherein each of the plurality of clusters includes at least a number of the plurality of categorical variables to reduce dimensionality of the data processing; dynamically subdividing one or more of the plurality of clusters, prior to generating a contingency table for the cluster, until a size of a contingency table for each resulting cluster is less than or equal to an available volatile memory of a computing system at the time of table creation; identifying a key variable for each of the plurality of clusters of categorical variables from the at least the number of the plurality of categorical variables to serve as a representative variable that maintains statistical properties of the cluster; creating a key cluster including the key variable for each of the plurality of clusters to enable efficient joining of data while preserving statistical relationships; creating a cluster contingency table for each of the plurality of clusters and for the key cluster, wherein the cluster contingency tables include combinations of values of the categorical variables obtained from the input data set and corresponding frequency of each combination to transform raw data into a structured format that preserves statistical distributions; generating, based on the cluster contingency table for each of the plurality of clusters and for the key cluster, a data set for each of the plurality of clusters and the key cluster that maintains statistical properties of the original data while removing direct connections to actual data entries; and generating the artificial data set based on a combination of the data set for each of the plurality of clusters and the key cluster, wherein the artificial data set maintains statistical properties of the input data set while providing enhanced data privacy and enabling analysis without exposing original sensitive information. 2 . The computer-implemented method of claim 1 , wherein the key variable for a cluster of categorical variables is identified as the categorical variable of the cluster having a largest average association with other categorical variables of the cluster. 3 . The computer-implemented method of claim 1 , wherein the association between each of the plurality of categorical variables is calculated as a Cramér's V value. 4 . The computer-implemented method of claim 1 , wherein the cluster contingency table for each of the plurality of clusters includes combinations with corresponding frequency values above a threshold minimum. 5 . The computer-implemented method of claim 4 , wherein the threshold minimum is received from a user. 6 . The computer-implemented method of claim 1 , wherein generating the artificial data set includes joining the data set for each of the plurality of clusters using the data set of the key cluster as a join key. 7 . The computer-implemented method of claim 1 , further comprising receiving a requested number of records in the artificial data set, and wherein generating the data set for each of the plurality of clusters and the key cluster includes generating the requested number of records for each of the plurality of clusters and the key cluster. 8 . A computer program product having one or more computer readable storage media having computer readable program code collectively stored on the one or more computer readable storage media, the computer readable program code being executed by a processor of a computer system to cause the computer system to perform operations for generating an artificial data set with improved data privacy and reduced computational resource requirements comprising: obtaining an input data set having a plurality of entries, wherein each entry includes a plurality of categorical variables; calculating, based on the input data set, an association between each of the plurality of categorical variables using a statistical correlation technique that identifies relationships between variables; creating, based on the association, a plurality of clusters of categorical variables, wherein each of the plurality of clusters includes at least a number of the plurality of categorical variables to reduce dimensionality of the data processing; dynamically subdividing one or more of the plurality of clusters, prior to generating a contingency table for the cluster, until a size of a contingency table for each resulting cluster is less than or equal to an available volatile memory of a computing system at the time of table creation; identifying a key variable for each of the plurality of clusters of categorical variables from the at least the number of the plurality of categorical variables to serve as a representative variable that maintains statistical properties of the cluster; creating a key cluster including the key variable for each of the plurality of clusters to enable efficient joining of data while preserving statistical relationships; creating a cluster contingency table for each of the plurality of clusters and for the key cluster, wherein the cluster contingency tables include combinations of values of the categorical variables obtained from the input data set and corresponding frequency of each combination to transform raw data into a structured format that preserves statistical distributions; generating, based on the cluster contingency table for each of the plurality of clusters and for the key cluster, a data set for each of the plurality of clusters and the key cluster that maintains statistical properties of the original data while removing direct connections to actual data entries; and generating the artificial data set based on a combination of the data set for each of the plurality of clusters and the key cluster, wherein the artificial data set maintains statistical properties of the input data set while providing enhanced data privacy and enabling analysis without exposing original sensitive information. 9 . The computer program product of claim 8 , wherein the key variable for a cluster of categorical variables is identified as the categorical variable of the cluster having a largest average association with other categorical variables of the cluster. 10 . The computer program product of claim 8 , wherein the association between each of the plurality of categorical variables is calculated as a Cramér's V value. 11 . The computer program product of claim 8 , wherein the cluster contingency table for each of the plurality of clusters includes combinations with corresponding frequency values above a threshold minimum. 12 . The computer program product of claim 11 , wherein the threshold minimum is received from a user. 13 . The computer program product of claim 8 , wherein generating the artificial data set includes joining the data set for each of the plurality of clusters using the data set of the key cluster as a join key. 14 . The computer program product of claim 8 , wherein the operations further comprise receiving a requested number of records in the artificial data set, and wherein generating the data set for e

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12561345B2 cover?
A computer-implemented method for generating an artificial data set is provided. Aspects include obtaining an input data set, calculating an association between the plurality of categorical variables of the input data set, and creating, based on the association, a plurality of clusters of categorical variables. Aspects also include identifying a key variable for each of the plurality of cluster…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 24 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).