Synthetic data testing in machine learning applications

US2025139500A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025139500-A1
Application numberUS-202318496983-A
CountryUS
Kind codeA1
Filing dateOct 30, 2023
Priority dateOct 30, 2023
Publication dateMay 1, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Determining whether synthetic data is sufficient for utilization in connection with one or more machine learning models. The computing device accesses a protected batch of data associated with a machine learning model. The computing device accesses a simulated batch of data, the simulated batch of data based upon but anonymizing the protected batch of data. The computing device accesses one or more comparisons of one or more variables in the protected batch of data and the simulated batch of data to obtain a similarity value. The computing device performs a machine learning function utilizing at least in-part the simulated batch of data if the similarity value exceeds a similarity threshold.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method using a computing device to determine whether synthetic data is sufficient for utilization in connection with one or more machine learning models, the method comprising: accessing by a computing device a protected batch of data associated with a machine learning model; accessing by the computing device a simulated batch of data, the simulated batch of data based upon but anonymizing the protected batch of data; access results of one or more comparisons of one or more variables in the protected batch of data and the simulated batch of data to obtain a similarity value; and performing by the computing device a machine learning function utilizing at least in-part the simulated batch of data if the similarity value exceeds a similarity threshold. 2 . The method of claim 1 , wherein the machine learning function is performing by one or more machine learning model an inference utilizing at least in-part the simulated batch of data. 3 . The method of claim 1 , wherein the machine learning function is training a machine learning model with the simulated batch of data. 4 . The method of claim 1 , wherein the one or more comparisons include comparison of a distribution of one or more variables associated with the protected batch of data and a distribution of one or more variables associated with the simulated batch of data. 5 . The method of claim 1 , wherein the one or more comparisons include calculation and comparison of correlation matrices of two or more variables associated with the protected batch of data and the simulated batch of data. 6 . The method of claim 1 , wherein the one or more comparisons include generation of a hierarchy cluster to compare all variables in the protected batch of data and the simulated batch of data. 7 . The method of claim 1 , wherein the one or more comparisons include generation of a relationship correlation between one or more traits displayed by variables included in the protected batch of data and the simulated batch of data. 8 . The method of claim 1 , wherein the computing device displays an output of the one or more comparisons, the output displaying a difference in the protected batch of data and the simulated batch of data. 9 . A method using a computing device to determine whether synthetic data is sufficient for utilization in connection with one or more machine learning models, the method comprising: accessing by a computing device a protected batch of data associated with a machine learning model; accessing by the computing device a simulated batch of data, the simulated batch of data based upon but anonymizing the protected batch of data; access results of one or more comparisons of one or more variables in the protected batch of data and the simulated batch of data to obtain a similarity value; and performing by the computing device a machine learning function utilizing at least in-part the simulated batch of data if the similarity value exceeds a similarity threshold, the machine learning function performing by one or more machine learning models an inference utilizing at least in part the simulated batch of data. 10 . The method of claim 9 , wherein the one or more comparisons include comparison of a distribution of one or more variables associated with the protected batch of data and a distribution of one or more variables associated with the simulated batch of data. 11 . The method of claim 9 , wherein the one or more comparisons include calculation and comparison of correlation matrices of two or more variables associated with the protected batch of data and the simulated batch of data. 12 . The method of claim 9 , wherein the one or more comparisons include generation of a hierarchy cluster to compare all variables in the protected batch of data and the simulated batch of data. 13 . The method of claim 9 , wherein the one or more comparisons include generation of a relationship correlation between one or more traits displayed by variables included in the protected batch of data and the simulated batch of data. 14 . A method using a computing device to determine whether synthetic data is sufficient for utilization in connection with one or more machine learning models, the method comprising: accessing by a computing device a protected batch of data associated with a machine learning model; accessing by the computing device a simulated batch of data, the simulated batch of data based upon but anonymizing the protected batch of data; access results of one or more comparisons of one or more variables in the protected batch of data and the simulated batch of data to obtain a similarity value; and performing by the computing device a machine learning function utilizing at least in-part the simulated batch of data if the similarity value exceeds a similarity threshold, the machine learning function training a machine learning model with the simulated batch of data. 15 . The method of claim 14 , wherein the one or more comparisons include comparison of a distribution of one or more variables associated with the protected batch of data and a distribution of one or more variables associated with the simulated batch of data. 16 . The method of claim 14 , wherein the one or more comparisons include calculation and comparison of correlation matrices of two or more variables associated with the protected batch of data and the simulated batch of data. 17 . The method of claim 14 , wherein the one or more comparisons include generation of a hierarchy cluster to compare all variables in the protected batch of data and the simulated batch of data. 18 . The method of claim 14 , wherein the one or more comparisons include generation of a relationship correlation between one or more traits displayed by variables included in the protected batch of data and the simulated batch of data. 19 . A computer system to determine whether synthetic data is sufficient for utilization in connection with one or more machine learning models, the computer system comprising: one or more computer processors; one or more computer-readable storage media; program instructions stored on the computer-readable storage media for execution by at least one of the one or more processors, the program instructions comprising: program instructions to access a protected batch of data associated with a machine learning model; program instructions to access a simulated batch of data, the simulated batch of data based upon but anonymizing the protected batch of data; program instructions to access results of one or more comparisons of one or more variables in the protected batch of data and the simulated batch of data to obtain a similarity value; and program instructions to perform a machine learning function utilizing at least in-part the simulated batch of data if the similarity value exceeds a similarity threshold. 20 . The computer system of claim 19 , wherein the one or more comparisons include comparison of a distribution of one or more variables associated with the protected batch of data and a distribution of one or more variables associated with the simulated batch of data. 21 . The computer system of claim 19 , wherein the one or more comparisons include calculation and comparison of correlation matrices of two or more variables associated with the protected batch of data and the simulated batch of data. 22 . The computer system of claim 19 , wherein the one or more comparisons inclu

Assignees

Inventors

Classifications

  • Inference or reasoning models · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025139500A1 cover?
Determining whether synthetic data is sufficient for utilization in connection with one or more machine learning models. The computing device accesses a protected batch of data associated with a machine learning model. The computing device accesses a simulated batch of data, the simulated batch of data based upon but anonymizing the protected batch of data. The computing device accesses one or …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).