System and method for building predictive model for synthesizing data

US12014253B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12014253-B2
Application numberUS-202217967147-A
CountryUS
Kind codeB2
Filing dateOct 17, 2022
Priority dateDec 12, 2013
Publication dateJun 18, 2024
Grant dateJun 18, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for constructing sets of synthetic data. A single data record is identified from a first set of data. The first set of data comprises a first plurality of data records, each of the data records including multiple items of data describing an entity. Using pattern recognition, the single data record is processed to identify a group of records from within the first set that have corresponding characteristics equivalent to the single data record. The identified group of records comprises a target set of variables and the group of records from the first set that are not identified comprises a control set of variables. The target set of variables and the control set of variables are processed, using probability estimation and optimization constraints, to determine a score for each of the records in the first set. The score describes how similar each of the records in the first set is to the single data record. The records associated with a percentage of the highest scores are identified. The data associated with the single data record is replaced with data associated with the identified records identified, item-by-item.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method performed by one or more processors, the computer-implemented method comprising: identifying a single data record from a first set of data records, each of the data records of the first set of data records including fields to store variables, respectively, describing an entity, at least one of the variables being associated with personal information; using pattern recognition, processing the single data record and identifying a group of records from within the first set of data records that have a target set of variables corresponding to the variables in the single data record, wherein a second group of the data records from the first set of data records that are not identified gave a control set of variables that are different than the variables of the single data record; determining scores for the data records of the first set of data records based on the variables of the data records of the first set of data records, the target set of variables, and the control set of variables, the scores corresponding to comparisons of the data records of the first set of data records, respectively, to the single data record; identifying ones of the data records having scores that are greater than a threshold; and replacing data in the single data record that is representative of the personal information with data associated with one or more of the ones of the data records having scores that are greater than the threshold under constraints of (a) maintaining one or more statistical characteristics of the fields and (b) removing the personal information; and training a predictive model using the ones of the data records having scores that are greater than the threshold, wherein, once trained, the predictive model generates a synthetic dataset that describes an original dataset without a possibility of matching an entry of the synthetic dataset back to the original dataset. 2. The computer-implemented method of claim 1 wherein determining the scores includes determining the scores using probability estimation. 3. The computer-implemented method of claim 1 wherein determining the scores further includes determining the scores using optimization constraints. 4. The computer-implemented method of claim 1 wherein identifying the ones of the data records having scores that are greater than a threshold includes identifying a percentage of the data records having a predetermined percentage of the highest scores that are greater than the threshold. 5. The computer-implemented method of claim 1 wherein the variables include at least age, gender, income, credit limit information. 6. The computer-implemented method of claim 1 wherein the synthetic dataset satisfies predetermined statistical characteristics relative to the original dataset. 7. The computer-implemented method of claim 1 wherein the original dataset includes data regarding financial information of users and the synthetic dataset includes data regarding insurance information for users. 8. The computer-implemented method of claim 1 further comprising obtaining the first set of data records from a second set of data records that includes a greater number of data records than the first set of data records. 9. The computer-implemented method of claim 8 wherein obtaining the first set of data records from the second set of data records includes removing ones of the data records of the second set of data records that include a variable that is different than a mean of values of the variable of the second set of data records. 10. The computer-implemented method of claim 8 wherein obtaining the first set of data records from the second set of data records includes removing ones of the data records of the second set of data records that include a variable that is at least a predetermined number of standard deviations from a mean of the values of the variable of the second set of data records. 11. A system comprising: one or more processors; and memory including instructions that, when executed by the one or more processors, perform to: identify a single data record from a first set of data records, each of the data records of the first set of data records including fields to store variables, respectively, describing an entity, at least one of the variables being associated with personal information; using pattern recognition, process the single data record and identify a group of records from within the first set of data records that have a target set of variables corresponding to the variables in the single data record, wherein a second group of the data records from the first set of data records that are not identified gave a control set of variables that are different than the variables of the single data record; determine scores for the data records of the first set of data records based on the variables of the data records of the first set of data records, the target set of variables, and the control set of variables, the scores corresponding to comparisons of the data records of the first set of data records, respectively, to the single data record; identify ones of the data records having scores that are greater than a threshold; and replace data in the single data record that is representative of the personal information with data associated with one or more of the ones of the data records having scores that are greater than the threshold under constraints of (a) maintaining one or more statistical characteristics of the fields and (b) removing the personal information; and train a predictive model using the ones of the data records having scores that are greater than the threshold, wherein, once trained, the predictive model generates a synthetic dataset that describes an original dataset without a possibility of matching an entry of the synthetic dataset back to the original dataset. 12. The system of claim 11 wherein the instructions include instructions that, when executed by the one or more processors, perform to determine the scores includes determining the scores using probability estimation and optimization constraints. 13. The system of claim 11 wherein the instructions include instructions that, when executed by the one or more processors, perform to identify the ones of the data records having scores that are greater than a threshold by identifying a percentage of the data records having a predetermined percentage of the highest scores that are greater than the threshold. 14. The system of claim 11 wherein the synthetic dataset satisfies predetermined statistical characteristics relative to the original dataset. 15. The system of claim 11 wherein the original dataset includes data regarding financial information of users and the synthetic dataset includes data regarding insurance information for users. 16. The system of claim 11 wherein the instructions further include instructions that, when executed by the one or more processors, perform to obtain the first set of data records from a second set of data records that includes a greater number of data records than the first set of data records. 17. The system of claim 16 wherein the instructions include instructions that, when executed by the one or more processors, perform to obtain the first set of data records from the second set of data records by removing ones of the data records of the second set of data records that include a variable that is different than a mean of the values of the variable of the second set of data records. 18. The system of claim 16 wherein the instructi

Assignees

Inventors

Classifications

  • Pattern matching networks; Rete networks · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12014253B2 cover?
Systems and methods for constructing sets of synthetic data. A single data record is identified from a first set of data. The first set of data comprises a first plurality of data records, each of the data records including multiple items of data describing an entity. Using pattern recognition, the single data record is processed to identify a group of records from within the first set that hav…
Who is the assignee on this patent?
Cigna Intellectual Property Inc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 18 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).