Synthetic data generation in federated learning systems
US-2023088561-A1 · Mar 23, 2023 · US
US12346776B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12346776-B2 |
| Application number | US-202016953030-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 19, 2020 |
| Priority date | Nov 19, 2020 |
| Publication date | Jul 1, 2025 |
| Grant date | Jul 1, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques are disclosed relating to training a machine learning model to understand one or more rules without explicitly executing the rule. In some embodiments, a computer system generates synthetic samples for a trained machine learning model usable to make a classification decision, where the synthetic samples are generated from a rule and a set of existing samples. In some embodiments, the set of existing samples are selected based on exceeding a confidence threshold for the classification decision. In some embodiments, the computer system retrains the trained machine learning model using the synthetic samples.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: generating, by a computer system, synthetic samples for a trained machine learning model usable to make a classification decision, wherein the generating includes: removing, based on a rule specifying a particular feature, the particular feature from a set of existing samples to generate a reduced-feature set of training samples, wherein the removing is performed based on the particular feature failing to comply with the rule, and wherein the particular feature is associated with biased classification decisions in the trained machine learning model; selecting a subset of the reduced-feature set of training samples having classification decisions that exceed a confidence threshold, wherein the subset includes less training samples than the reduced-feature set of training samples; and reinserting the particular feature into samples in the selected subset, wherein values for the reinserted particular feature in samples in the selected subset are different than values of the particular feature of corresponding samples in the set of existing samples prior to the removing; and retraining, by the computer system, the trained machine learning model using the synthetic samples that include new values for the particular feature that is associated with biased classification decisions, wherein the retraining reduces bias in the trained machine learning model; and executing, by the computer system, the retrained machine learning model to generate unbiased classifications for one or more new samples. 2. The method of claim 1 , wherein the synthetic samples are generated by: creating copies of samples in the set of existing samples; and replacing values in a particular feature of the copied samples with a value specified in the rule. 3. The method of claim 2 , wherein the synthetic samples are further generated by: assigning a negative label to the synthetic samples. 4. The method of claim 1 , wherein the confidence threshold for the classification decision is determined based on a decision boundary of the trained machine learning model. 5. The method of claim 1 , wherein the retraining includes: weighting the synthetic samples based on a total number of generated synthetic samples, wherein the synthetic samples are weighted more heavily than the set of existing samples. 6. The method of claim 1 , wherein the rule specifies that a first class of samples have a lower threshold for a positive classification label than a second class of samples, and wherein the set of existing samples are selected based on having a classification score that is lower than a standard classification score threshold. 7. The method of claim 6 , wherein the synthetic samples are generated to include one or more features associated with the first class of samples, wherein the first class of samples includes a set of favored accounts, and wherein the second class of samples includes a set of non-favored accounts. 8. The method of claim 1 , wherein the retraining is further performed using both the synthetic samples and a plurality of existing samples from which the set of existing samples were selected. 9. A non-transitory computer-readable medium having instructions stored thereon that are executable by a computing device to perform operations comprising: generating, from an initial set of samples used to train a machine learning model, an augmented set of samples, wherein the generating includes: removing a particular feature from one or more of the initial set of samples to generate a reduced-feature set of training samples; determining classification scores for samples in the reduced-feature set of training samples; selecting a group of the reduced-feature set of samples that meet a confidence threshold, wherein the group includes less training samples than the reduced-feature set of training samples; generating synthetic samples by reinserting the particular feature into samples in the selected group, wherein values for the reinserted particular feature in samples in the selected group are different than values of the particular feature in the initial set of samples prior to the removing; adding the synthetic samples to the initial set of samples to generate the augmented set of samples; and retraining the trained machine learning model using the augmented set of samples that include new values for the particular feature that is associated with biased classification decisions, wherein the retraining reduces bias in the trained machine learning model; and generating, using the retrained machine learning model, unbiased classifications for one or more unlabeled samples. 10. The non-transitory computer-readable medium of claim 9 , wherein generating the augmented set of samples further includes: assigning a positive classification label to the synthetic samples. 11. The non-transitory computer-readable medium of claim 9 , wherein the retraining includes: determining, based on a ratio of a total number of existing samples included in the augmented set to a total number of synthetic samples, weight values for the synthetic samples; and weighting the synthetic samples based on the determined weight values, wherein weight values for the synthetic samples are greater than weight values for the initial set of samples. 12. The non-transitory computer-readable medium of claim 9 , wherein the confidence threshold specifies a threshold difference between a decision boundary of the trained machine learning model and classification scores. 13. The non-transitory computer-readable medium of claim 9 , wherein reinserting the particular feature into samples in the selected group is performed based on a value specified in a rule. 14. The non-transitory computer-readable medium of claim 9 , wherein reinserting the particular feature into samples of the selected group is performed based on a token. 15. The non-transitory computer-readable medium of claim 14 , wherein the token corresponds to characteristics of data being classified by the trained machine learning model that are associated with bias, and wherein the retraining reduces bias of the machine learning model. 16. A method, comprising: generating, by a computer system, synthetic samples for a machine learning model trained to make a classification decision based on existing samples, wherein the generating includes: removing, based on a token, a particular feature from a set of existing samples to generate a reduced-feature set of training samples, wherein the particular feature is associated with biased classification decisions in the trained machine learning model; selecting a subset of the reduced-feature set of training samples having classification decisions that exceed a confidence threshold, wherein the subset includes less training samples than the reduced-feature set of training samples; and reinserting the particular feature into samples in the selected subset, wherein values for the reinserted particular feature in samples in the selected subset are different than values of the particular feature of corresponding samples in the set of existing samples prior to the removing; and retraining, by the computer system, the machine learning model using the synthetic samples that include new values for the particular feature that is associated with biased classification decisions, wherein the retraining reduces bias in the trained machine learning model; and executing, by the computer system, the retrained machine learning model to generate unbiased classifications for one or more new samples. 17.
Related publications grouped by family.
Answers are generated from the same data shown on this page.