System, method, and computer-accessible medium for evaluating multi-dimensional synthetic data using integrated variants analysis

US12175308B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12175308-B2
Application numberUS-202418402937-A
CountryUS
Kind codeB2
Filing dateJan 3, 2024
Priority dateJul 6, 2018
Publication dateDec 24, 2024
Grant dateDec 24, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An exemplary system, method, and computer-accessible medium can include, for example, receiving an original dataset(s), receiving a synthetic dataset(s), training a model(s) using the original dataset(s) and the synthetic dataset(s), and evaluating the synthetic dataset(s) based on the training of the model(s). The model(s) can include a first model and a second model, and the first model can be trained using the original dataset(s) and the second model can be trained using the synthetic dataset(s). The synthetic dataset(s) can be evaluated by comparing first results from the training of the first model to second results from the training of the second model.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer-accessible medium having stored thereon computer-executable instructions for evaluating a synthetic dataset, wherein, when a computer hardware arrangement executes the instructions, the computer hardware arrangement is configured to perform procedures comprising: training a model using an original dataset and a synthetic dataset; determining a data similarity score including a combined score of exact-match overlap score and fuzzy-match overlap score based on the synthetic dataset and the original dataset; determining a data quality score including a combined score of row-duplicate score, repeated-value score and schema-preservation score based on the synthetic dataset and the original dataset; evaluating the synthetic dataset based on the training of the model, the data similarity score, and the data quality score; determining a region for the synthetic dataset based on evaluating the synthetic dataset, wherein the region defines a status of the synthetic dataset; and generating a suggestion based on the determined region for building predicative models on the synthetic dataset, wherein the suggestion includes at least one of (a) indicating that the at least one synthetic dataset is adequate or (b) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset. 2. The non-transitory computer-accessible medium 1 , wherein the procedures further comprises generating a statistical correlation score based on the synthetic dataset and the original dataset. 3. The non-transitory computer-accessible medium 2 , wherein the procedures further comprises generating a univariate distribution score based on the synthetic dataset and the original dataset. 4. The non-transitory computer-accessible medium 3 , wherein the procedures further comprises evaluating the synthetic dataset further based on the statistical correlation score and the univariate distribution score. 5. The non-transitory computer-accessible medium 1 , wherein the suggestion provides information for a data application. 6. The non-transitory computer-accessible medium 1 , wherein the model comprises a behavior classification model. 7. The non-transitory computer-accessible medium 1 , wherein: training a model comprises training a first model and training a second model, the first model is trained using the original dataset, and the second model is trained using the synthetic dataset. 8. The non-transitory computer-accessible medium 7 , wherein the procedures further comprise evaluating the synthetic dataset by comparing first results from the training of the first model to second results from the training of the second model. 9. The non-transitory computer-accessible medium 8 , wherein the first results are compared to the second results using an analysis of variance procedure. 10. The non-transitory computer-accessible medium 72 , wherein: the analysis of variance procedure comprises a degrees of freedom divisor and a sum of squares summation, the analysis of variance procedure results in a mean square, and the means square comprises square terms as deviations from a sample mean. 11. The non-transitory computer-accessible medium 72 , wherein: the analysis of variance procedure estimates at least one of (a) a total variance based on all the observation deviations from a grand mean, (ii) an error variance based on all the observation deviations from their appropriate treatment means, or (iii) a treatment variance, and the treatment variance is based on deviations of a treatment means from the grand mean multiplied by a number of observations in each treatment. 12. A system comprising a computer hardware arrangement, wherein the computer hardware arrangement is configured to: train a model using an original dataset and a synthetic dataset; determine a data similarity score including a combined score of exact-match overlap score and fuzzy-match overlap score based on the synthetic dataset and the original dataset; evaluate the synthetic dataset based on the training of the model and the data similarity; determine a region for the synthetic dataset based on evaluating the synthetic dataset, wherein the region defines a status of the synthetic dataset; and generate a suggestion based on the determined region for building predicative models on the synthetic dataset, wherein the suggestion includes at least one of (a) indicating that the at least one synthetic dataset is adequate or (b) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset. 13. The system of claim 12 , wherein the computer hardware arrangement is further configured to: determine a data quality score including a combined score of row-duplicate score, repeated-value score and schema-preservation score based on the synthetic dataset and the original dataset; and evaluate the synthetic dataset further based on the data quality score. 14. The system of claim 12 , wherein the computer hardware arrangement is further configured to generate a further synthetic dataset based on the synthetic dataset and the evaluation of the synthetic dataset. 15. The system of claim 14 , wherein the computer hardware arrangement is further configured to train the second model based on the at least one further synthetic dataset, and evaluating the at least one further synthetic dataset based on the training of the at least one second model on the at least one further synthetic dataset. 16. A method performed by a computer hardware arrangement, the method comprising: training a model using an original dataset and a synthetic dataset; determining a data quality score including a combined score of row-duplicate score, repeated-value score and schema-preservation score based on the synthetic dataset and the original dataset; evaluating the synthetic dataset based on the training of the model and the data quality score; determining a region for the synthetic dataset based on evaluating the synthetic dataset, wherein the region defines a status of the synthetic dataset; and generating a suggestion based on the determined region for building predicative models on the synthetic dataset, wherein the suggestion includes at least one of (a) indicating that the at least one synthetic dataset is adequate or (b) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset. 17. The method of claim 16 , further comprising: determining a data similarity score including a combined score of exact-match overlap score and fuzzy-match overlap score based on the synthetic dataset and the original dataset; and evaluating the synthetic dataset further based on the data similarity score. 18. The method of claim 16 , wherein: training a model comprises training a first model and training a second model, wherein the first model is trained using the original dataset, and the second model is trained using the synthetic dataset; and the method further comprises evaluating the synthetic dataset by comparing first results from the training of the first model to second results from the training of the second model, wherein the comparison of first results to the second results uses a threshold procedure comprising determining a further statistical correlation based on a plurality of covariance matrices. 19. The method of claim 18 , wherein the threshold procedure further comprises summing first

Assignees

Inventors

Classifications

  • Texturing; Colouring; Generation of textures or colours (retouching, inpainting or scratch removal G06T5/77) · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

  • Supervised learning · CPC title

  • Adversarial learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12175308B2 cover?
An exemplary system, method, and computer-accessible medium can include, for example, receiving an original dataset(s), receiving a synthetic dataset(s), training a model(s) using the original dataset(s) and the synthetic dataset(s), and evaluating the synthetic dataset(s) based on the training of the model(s). The model(s) can include a first model and a second model, and the first model can b…
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06F9/541. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 24 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).