System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis

US11900178B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11900178-B2
Application numberUS-202217845786-A
CountryUS
Kind codeB2
Filing dateJun 21, 2022
Priority dateJul 6, 2018
Publication dateFeb 13, 2024
Grant dateFeb 13, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An exemplary system, method, and computer-accessible medium can include, for example, receiving an original dataset(s), receiving a synthetic dataset(s), training a model(s) using the original dataset(s) and the synthetic dataset(s), and evaluating the synthetic dataset(s) based on the training of the model(s). The model(s) can include a first model and a second model, and the first model can be trained using the original dataset(s) and the second model can be trained using the synthetic dataset(s). The synthetic dataset(s) can be evaluated by comparing first results from the training of the first model to second results from the training of the second model.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer-accessible medium having stored thereon computer-executable instructions for evaluating a synthetic dataset, wherein, when a computer hardware arrangement executes the instructions, the computer hardware arrangement is configured to perform procedures comprising: training a model using an original dataset and a synthetic dataset; generating a statistical correlation score based on the synthetic dataset and the original dataset; generating a univariate distribution score based on the synthetic dataset and the original dataset; generating an evaluation score by evaluating the synthetic dataset based on the training of the model, wherein the evaluation score includes the statistical correlation score and the univariate distribution score; determining a region for the synthetic dataset based on the evaluation score, wherein the region defines the status of the synthetic dataset; and generating a suggestion based on the evaluation score and the determined region, wherein the suggestion provides information for a data application. 2. The non-transitory computer-accessible medium of claim 1 , wherein the model comprises a behavior classification model. 3. The non-transitory computer-accessible medium of claim 1 , wherein: training a model comprises training a first model and training a second model, the first model is trained using the original dataset, and the second model is trained using the synthetic dataset. 4. The non-transitory computer-accessible medium of claim 3 , wherein the procedures further comprise evaluating the synthetic dataset by comparing first results from the training of the first model to second results from the training of the second model. 5. The non-transitory computer-accessible medium of claim 4 , wherein the first results are compared to the second results using an analysis of variance procedure. 6. The non-transitory computer-accessible medium of claim 5 , wherein: the analysis of variance procedure comprises a degrees of freedom divisor and a sum of squares summation, the analysis of variance procedure results in a mean square, and the means square comprises square terms as deviations from a sample mean. 7. The non-transitory computer-accessible medium of claim 5 , wherein the analysis of variance procedure estimates at least one of (a) a total variance based on all the observation deviations from a grand mean, (ii) an error variance based on all the observation deviations from their appropriate treatment means, or (iii) a treatment variance. 8. The non-transitory computer-accessible medium of claim 7 , wherein the treatment variance is based on deviations of a treatment means from the grand mean multiplied by a number of observations in each treatment. 9. The non-transitory computer-accessible medium of claim 4 , wherein the procedures further comprise generating a further synthetic dataset based on the synthetic dataset and the evaluation of the synthetic dataset. 10. The non-transitory computer-accessible medium of claim 9 , wherein the procedures further comprise: training the second model based on the at least one further synthetic dataset, and evaluating the at least one further synthetic dataset based on the training of the at least one second model on the at least one further synthetic dataset. 11. A system, comprising: a computer hardware arrangement configured to: train a model using an original dataset and a synthetic dataset; generate a statistical correlation score based on the synthetic dataset and the original dataset; generate a univariate distribution score based on the synthetic dataset and the original dataset; generate an evaluation score by evaluating the synthetic dataset based on the training of the model, wherein the evaluation score includes the statistical correlation score and the univariate distribution score; determine a region for the synthetic dataset based on the evaluation score, wherein the region defines the status of the synthetic dataset; and generate a suggestion based on the evaluation score and the determined region, wherein the suggestion provides information for a data application. 12. The system of claim 11 , wherein the suggestion includes at least one of (a) indicating that the at least one synthetic dataset is adequate or (b) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset. 13. The system of claim 11 , wherein the region includes one of (i) a normal region where the synthetic dataset is unlikely to contain synthetic data that is similar to original data within the original dataset, (ii) a warning region where the synthetic dataset at least one of (a) potentially contains the synthetic data that is similar to the original data or (b) the synthetic data does not substantially match a schema of the original dataset, or (iii) a red flag region where the synthetic dataset is likely to contain the synthetic data that is similar to the original data. 14. The non-transitory computer-accessible medium of claim 1 , wherein the region includes one of (i) a normal region where the synthetic dataset is unlikely to contain synthetic data that is similar to original data within the original dataset, (ii) a warning region where the synthetic dataset at least one of (a) potentially contains the synthetic data that is similar to the original data or (b) the synthetic data does not substantially match a schema of the original dataset, or (iii) a red flag region where the synthetic dataset is likely to contain the synthetic data that is similar to the original data. 15. A method performed by a computer hardware arrangement, the method comprising: training a model using an original dataset and a synthetic dataset; generating a statistical correlation score based on the synthetic dataset and the original dataset; generating a univariate distribution score based on the synthetic dataset and the original dataset; generating an evaluation score by evaluating the synthetic dataset based on the training of the model, wherein the evaluation score includes the statistical correlation score and the univariate distribution score; determining a region for the synthetic dataset based on the evaluation score, wherein the region defines the status of the synthetic dataset; and generating a suggestion based on the evaluation score and the determined region, wherein the suggestion provides information for a data application. 16. The method of claim 15 wherein: training a model comprises training a first model and training a second model, the first model is trained using the original dataset, and the second model is trained using the synthetic dataset. 17. The method of claim 16 , wherein the method further comprises evaluating the synthetic dataset by comparing first results from the training of the first model to second results from the training of the second model. 18. The method of claim 17 , wherein the comparison of first results to the second results uses a threshold procedure comprising: summing first errors from the first results, summing second errors from the second results, and comparing the summed first errors to the summed second errors. 19. The method of claim 18 , wherein the threshold procedure includes determining a further statistical correlation based on a plurality of covariance matrices. 20. The method of claim 15 , wherein the region includes one of (i) a normal region where sy

Assignees

Inventors

Classifications

  • Texturing; Colouring; Generation of textures or colours (retouching, inpainting or scratch removal G06T5/77) · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

  • Supervised learning · CPC title

  • Adversarial learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11900178B2 cover?
An exemplary system, method, and computer-accessible medium can include, for example, receiving an original dataset(s), receiving a synthetic dataset(s), training a model(s) using the original dataset(s) and the synthetic dataset(s), and evaluating the synthetic dataset(s) based on the training of the model(s). The model(s) can include a first model and a second model, and the first model can b…
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06F9/541. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 13 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).