System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis

US10635939B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10635939-B2
Application numberUS-201816152072-A
CountryUS
Kind codeB2
Filing dateOct 4, 2018
Priority dateJul 6, 2018
Publication dateApr 28, 2020
Grant dateApr 28, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An exemplary system, method, and computer-accessible medium can include, for example, receiving an original dataset(s), receiving a synthetic dataset(s), training a model(s) using the original dataset(s) and the synthetic dataset(s), and evaluating the synthetic dataset(s) based on the training of the model(s). The model(s) can include a first model and a second model, and the first model can be trained using the original dataset(s) and the second model can be trained using the synthetic dataset(s). The synthetic dataset(s) can be evaluated by comparing first results from the training of the first model to second results from the training of the second model.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer-accessible medium having stored thereon computer-executable instructions for evaluating at least one synthetic dataset, wherein, when a computer arrangement executes the instructions, the computer arrangement is configured to perform procedures comprising: receiving at least one original dataset; receiving the at least one synthetic dataset; training at least one model using the at least one original dataset and the at least one synthetic dataset; generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset; generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score; determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset. 2. The computer-accessible medium of claim 1 , wherein the at least one model includes a first model and a second model, and wherein the computer arrangement is further configured to: train the first model using the at least one original dataset; and train the second model using the at least one synthetic dataset. 3. The computer-accessible medium of claim 2 , wherein the computer arrangement is configured to evaluate the at least one synthetic dataset by comparing first results from the training of the first model to second results from the training of the second model. 4. The computer-accessible medium of claim 3 , wherein the computer arrangement is configured to compare the first results to the second results using an analysis of variance procedure. 5. The computer-accessible medium of claim 2 , wherein the computer arrangement is configured to compare the first results to the second results using a threshold procedure. 6. The computer-accessible medium of claim 5 , wherein the threshold procedure includes: summing first errors from the first results; summing second errors from the second results; and comparing the summed first errors to the summed second errors. 7. The computer-accessible medium of claim 6 , wherein the computer arrangement is configured to compare the summed first errors to the summed second errors using a threshold criterion. 8. The computer-accessible medium of claim 5 , wherein the threshold procedure includes determining a further statistical correlation based on a plurality of covariance matrices. 9. The computer-accessible medium of claim 2 , wherein the first model is equivalent to the second model. 10. The computer-accessible medium of claim 1 , wherein the at least one model is a classification model. 11. The computer-accessible medium of claim 1 , wherein the computer arrangement is further configured to generate the at least one synthetic dataset. 12. The computer-accessible medium of claim 11 , wherein the computer arrangement is configured to generate the at least one synthetic dataset based on the at least one original dataset. 13. The computer-accessible medium of claim 1 , wherein the computer arrangement is further configured to generate at least one further synthetic dataset based on (i) the at least one synthetic dataset and (ii) the evaluation of the at least one synthetic dataset. 14. The computer-accessible medium of claim 1 , wherein the at least one original dataset and the at least one synthetic dataset include at least one of (i) biographical information regarding a plurality of customers or (ii) financial information regarding the plurality of customers. 15. A method for evaluating at least one synthetic dataset, comprising: (a) receiving at least one original dataset; (b) generating the at least one synthetic dataset based on the at least one original dataset; (c) training at least one first model using the at least one original dataset; (d) training at least one second model using the at least one synthetic dataset; (e) generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset using a computer arrangement, generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one first model and the training of the at least one second model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score; determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset. 16. The method of claim 15 , further comprising generating at least one further synthetic dataset based on the evaluation score and the at least one synthetic dataset. 17. The method of claim 16 , further comprising training the at least one second model based on the at least one further synthetic dataset. 18. The method of claim 17 , further comprising evaluating the at least one further synthetic dataset based on the training of the at least one second model on the at least one further synthetic dataset. 19. A system, comprising: a computer hardware arrangement configured to: (a) receive at least one original dataset; (b) receive at least one synthetic dataset related to the at least one original dataset; (c) train at least one first model using the at least one original dataset; (d) train at least one second model using the at least one synthetic dataset; (e) generate a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset; (f) generate an evaluation score by comparing first results from the training of the first model to second results from the training of the second model, wherein the evaluation score includes (i) a statistical correlation score, (ii) a data similarity score, and (iii) a data quality score; (g) determine a region for the a

Assignees

Inventors

Classifications

  • Non-supervised learning, e.g. competitive learning · CPC title

  • G06F9/541Primary

    via adapters, e.g. between incompatible applications · CPC title

  • Ensemble learning · CPC title

  • using kernel methods, e.g. support vector machines [SVM] · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10635939B2 cover?
An exemplary system, method, and computer-accessible medium can include, for example, receiving an original dataset(s), receiving a synthetic dataset(s), training a model(s) using the original dataset(s) and the synthetic dataset(s), and evaluating the synthetic dataset(s) based on the training of the model(s). The model(s) can include a first model and a second model, and the first model can b…
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06F9/541. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 28 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).