Training data augmentation for machine learning

US12346776B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12346776-B2
Application numberUS-202016953030-A
CountryUS
Kind codeB2
Filing dateNov 19, 2020
Priority dateNov 19, 2020
Publication dateJul 1, 2025
Grant dateJul 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are disclosed relating to training a machine learning model to understand one or more rules without explicitly executing the rule. In some embodiments, a computer system generates synthetic samples for a trained machine learning model usable to make a classification decision, where the synthetic samples are generated from a rule and a set of existing samples. In some embodiments, the set of existing samples are selected based on exceeding a confidence threshold for the classification decision. In some embodiments, the computer system retrains the trained machine learning model using the synthetic samples.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: generating, by a computer system, synthetic samples for a trained machine learning model usable to make a classification decision, wherein the generating includes: removing, based on a rule specifying a particular feature, the particular feature from a set of existing samples to generate a reduced-feature set of training samples, wherein the removing is performed based on the particular feature failing to comply with the rule, and wherein the particular feature is associated with biased classification decisions in the trained machine learning model; selecting a subset of the reduced-feature set of training samples having classification decisions that exceed a confidence threshold, wherein the subset includes less training samples than the reduced-feature set of training samples; and reinserting the particular feature into samples in the selected subset, wherein values for the reinserted particular feature in samples in the selected subset are different than values of the particular feature of corresponding samples in the set of existing samples prior to the removing; and retraining, by the computer system, the trained machine learning model using the synthetic samples that include new values for the particular feature that is associated with biased classification decisions, wherein the retraining reduces bias in the trained machine learning model; and executing, by the computer system, the retrained machine learning model to generate unbiased classifications for one or more new samples. 2. The method of claim 1 , wherein the synthetic samples are generated by: creating copies of samples in the set of existing samples; and replacing values in a particular feature of the copied samples with a value specified in the rule. 3. The method of claim 2 , wherein the synthetic samples are further generated by: assigning a negative label to the synthetic samples. 4. The method of claim 1 , wherein the confidence threshold for the classification decision is determined based on a decision boundary of the trained machine learning model. 5. The method of claim 1 , wherein the retraining includes: weighting the synthetic samples based on a total number of generated synthetic samples, wherein the synthetic samples are weighted more heavily than the set of existing samples. 6. The method of claim 1 , wherein the rule specifies that a first class of samples have a lower threshold for a positive classification label than a second class of samples, and wherein the set of existing samples are selected based on having a classification score that is lower than a standard classification score threshold. 7. The method of claim 6 , wherein the synthetic samples are generated to include one or more features associated with the first class of samples, wherein the first class of samples includes a set of favored accounts, and wherein the second class of samples includes a set of non-favored accounts. 8. The method of claim 1 , wherein the retraining is further performed using both the synthetic samples and a plurality of existing samples from which the set of existing samples were selected. 9. A non-transitory computer-readable medium having instructions stored thereon that are executable by a computing device to perform operations comprising: generating, from an initial set of samples used to train a machine learning model, an augmented set of samples, wherein the generating includes: removing a particular feature from one or more of the initial set of samples to generate a reduced-feature set of training samples; determining classification scores for samples in the reduced-feature set of training samples; selecting a group of the reduced-feature set of samples that meet a confidence threshold, wherein the group includes less training samples than the reduced-feature set of training samples; generating synthetic samples by reinserting the particular feature into samples in the selected group, wherein values for the reinserted particular feature in samples in the selected group are different than values of the particular feature in the initial set of samples prior to the removing; adding the synthetic samples to the initial set of samples to generate the augmented set of samples; and retraining the trained machine learning model using the augmented set of samples that include new values for the particular feature that is associated with biased classification decisions, wherein the retraining reduces bias in the trained machine learning model; and generating, using the retrained machine learning model, unbiased classifications for one or more unlabeled samples. 10. The non-transitory computer-readable medium of claim 9 , wherein generating the augmented set of samples further includes: assigning a positive classification label to the synthetic samples. 11. The non-transitory computer-readable medium of claim 9 , wherein the retraining includes: determining, based on a ratio of a total number of existing samples included in the augmented set to a total number of synthetic samples, weight values for the synthetic samples; and weighting the synthetic samples based on the determined weight values, wherein weight values for the synthetic samples are greater than weight values for the initial set of samples. 12. The non-transitory computer-readable medium of claim 9 , wherein the confidence threshold specifies a threshold difference between a decision boundary of the trained machine learning model and classification scores. 13. The non-transitory computer-readable medium of claim 9 , wherein reinserting the particular feature into samples in the selected group is performed based on a value specified in a rule. 14. The non-transitory computer-readable medium of claim 9 , wherein reinserting the particular feature into samples of the selected group is performed based on a token. 15. The non-transitory computer-readable medium of claim 14 , wherein the token corresponds to characteristics of data being classified by the trained machine learning model that are associated with bias, and wherein the retraining reduces bias of the machine learning model. 16. A method, comprising: generating, by a computer system, synthetic samples for a machine learning model trained to make a classification decision based on existing samples, wherein the generating includes: removing, based on a token, a particular feature from a set of existing samples to generate a reduced-feature set of training samples, wherein the particular feature is associated with biased classification decisions in the trained machine learning model; selecting a subset of the reduced-feature set of training samples having classification decisions that exceed a confidence threshold, wherein the subset includes less training samples than the reduced-feature set of training samples; and reinserting the particular feature into samples in the selected subset, wherein values for the reinserted particular feature in samples in the selected subset are different than values of the particular feature of corresponding samples in the set of existing samples prior to the removing; and retraining, by the computer system, the machine learning model using the synthetic samples that include new values for the particular feature that is associated with biased classification decisions, wherein the retraining reduces bias in the trained machine learning model; and executing, by the computer system, the retrained machine learning model to generate unbiased classifications for one or more new samples. 17.

Assignees

Inventors

Classifications

  • G06F21/52Primary

    during program execution, e.g. stack integrity {; Preventing unwanted data erasure; Buffer overflow} · CPC title

  • Protecting personal data, e.g. for financial or medical purposes · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12346776B2 cover?
Techniques are disclosed relating to training a machine learning model to understand one or more rules without explicitly executing the rule. In some embodiments, a computer system generates synthetic samples for a trained machine learning model usable to make a classification decision, where the synthetic samples are generated from a rule and a set of existing samples. In some embodiments, the…
Who is the assignee on this patent?
Paypal Inc
What technology area does this patent fall under?
Primary CPC classification G06F21/52. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).