What technology area does this patent fall under?

Primary CPC classification G06N20/00. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 02 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Dataset quality for synthetic data generation in computer-based reasoning systems

US11640561B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11640561-B2
Application number	US-202117333671-A
Country	US
Kind code	B2
Filing date	May 28, 2021
Priority date	Dec 13, 2018
Publication date	May 2, 2023
Grant date	May 2, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for synthetic data generation in computer-based reasoning systems are discussed and include receiving a request for generation of synthetic data based on a set of training data cases. One or more focal training data cases are determined. For undetermined features (either all of them or those that are not subject to conditions), a value for the feature is determined based on the focal cases. In some embodiments, the generated synthetic data may be checked for similarity against the training data, and if similarity conditions are met, it may be modified (e.g., resampled), removed, and/or replaced.

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer readable medium storing instructions which, when executed by one or more computing devices, cause the one or more computing devices to perform a process of: receiving a request for generation of synthetic data based on a set of training data cases that meets a dataset quality threshold; repeatedly determining new a synthetic data case based on a set of one or more focal training data cases to generate a set of two or more synthetic data cases, wherein each set of one or more focal cases are determined from the set of training data cases; determining a dataset quality metric for the set of two or more synthetic data cases based on the set of training data cases and the set of two or more synthetic data cases, wherein the dataset quality metric is determined based at least in part on at least one dataset privacy metric, which quantifies the likelihood of identification of private data in the set of training data cases from the set of two or more synthetic data cases; when the dataset quality metric for particular synthetic data cases in the set of two or more synthetic data cases does not meet a dataset quality threshold, taking corrective action for the particular synthetic data cases in the set of two or more synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases; when the dataset quality metric for the set of two or more synthetic data cases meets the dataset quality threshold, causing control of a controllable system using the set of two or more synthetic data cases. 2. The non-transitory computer readable medium of claim 1 , wherein taking corrective action comprises one or more of: modifying one or more of the particular synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases; deleting one or more of the particular synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases; and replacing one or more of the particular synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases. 3. The non-transitory computer readable medium of claim 1 , wherein determining at least one dataset quality metric comprises determining a dataset privacy metric based at least in part on a data element distance comparison metric, and when the dataset privacy metric for particular synthetic data cases in the set of two or more synthetic data cases does not meet the dataset quality threshold, taking corrective action for the particular synthetic data cases in set of two or more synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases. 4. The non-transitory computer readable medium of claim 1 , wherein determining at least one dataset quality metric comprises determining a dataset privacy metric based at least in part on a minimum distance ratio metric, and when the dataset privacy metric for particular synthetic data cases in the set of two or more synthetic data cases does not meet the dataset quality threshold, taking corrective action for the particular synthetic data cases in set of two or more synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases. 5. The non-transitory computer readable medium of claim 1 , wherein determining at least one dataset quality metric comprises determining a dataset privacy metric based at least in part on a minimum distance percentile metric, and when the dataset privacy metric for particular synthetic data cases in the set of two or more synthetic data cases does not meet the dataset quality threshold, taking corrective action for the particular synthetic data cases in set of two or more synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases. 6. The non-transitory computer readable medium of claim 1 , wherein determining at least one dataset quality metric comprises determining a dataset privacy metric based at least in part on a minimum expected distance to actual distance metric, and when the dataset privacy metric for particular synthetic data cases in the set of two or more synthetic data cases does not meet the dataset quality threshold, taking corrective action for the particular synthetic data cases in set of two or more synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases. 7. The non-transitory computer readable medium of claim 1 , wherein determining at least one dataset quality metric comprises determining a dataset privacy metric based at least in part on a probability-based minimum distance metric, and when the dataset privacy metric for particular synthetic data cases in the set of two or more synthetic data cases does not meet the dataset quality threshold, taking corrective action for the particular synthetic data cases in set of two or more synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases. 8. A system for executing instructions, wherein said instructions are instructions which, when executed by one or more computing devices, cause performance of a process including: receiving a request for generation of synthetic data based on a set of training data cases that meets a dataset quality threshold; repeatedly determining new a synthetic data case based on a set of one or more focal training data cases to generate a set of two or more synthetic data cases, wherein each set of one or more focal cases are determined from the set of training data cases; determining a dataset quality metric for the set of two or more synthetic data cases based on the set of training data cases and the set of two or more synthetic data cases, wherein the dataset quality metric is determined based at least in part on: at least one statistical quality metric that compares statistical properties of the set of training data cases and the set of two or more synthetic data cases; at least one model comparison metric that quantifies machine learning model properties and performance of the set of training data cases and the set of two or more synthetic data cases; at least one dataset privacy metric, which quantifies the likelihood of identification of private data in the set of training data cases from the set of two or more synthetic data cases; when the dataset quality metric for particular synthetic data cases in the set of two or more synthetic data cases does not meet a dataset quality threshold, taking corrective action for the particular synthetic data cases in the set of two or more synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases; when the dataset quality metric for the set of two or more synthetic data cases meets the dataset quality threshold, causing control of a controllable system using the set of two or more synthetic data cases. 9. The system of claim 8 , wherein taking corrective action comprises one or more of: modifying one or more of the particular synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases; deleting one or more of the particular synthetic data cases to produce a new set of two or more synthetic data cases to use as the set of two or more synthetic data cases; and replacing one or more of the particular synthetic data cases t

Assignees

Diveplane Corp

Inventors

Classifications

G06F18/2148
characterised by the process organisation or structure, e.g. boosting cascade · CPC title
G06V10/7796
based on specific statistical tests · CPC title
G06V10/761
Proximity, similarity or dissimilarity measures · CPC title
G06V10/763
Non-hierarchical techniques, e.g. based on statistics of modelling distributions · CPC title
G06N20/00Primary
Machine learning · CPC title

Patent family

Related publications grouped by family.

View patent family 86146003

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11640561B2 cover?: Techniques for synthetic data generation in computer-based reasoning systems are discussed and include receiving a request for generation of synthetic data based on a set of training data cases. One or more focal training data cases are determined. For undetermined features (either all of them or those that are not subject to conditions), a value for the feature is determined based on the focal…
Who is the assignee on this patent?: Diveplane Corp
What technology area does this patent fall under?: Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 02 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).