System and method for synthesizing data
US-10423890-B1 · Sep 24, 2019 · US
US11514515B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11514515-B2 |
| Application number | US-201816037700-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 17, 2018 |
| Priority date | Jul 17, 2018 |
| Publication date | Nov 29, 2022 |
| Grant date | Nov 29, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and non-transitory computer readable storage media are disclosed for using reject inference to generate synthetic data for modifying lead scoring models. For example, the disclosed system identifies an original dataset corresponding to an output of a lead scoring model that generates scores for a plurality of prospects to indicate a likelihood of success of prospects of the plurality of prospects. In one or more embodiments, the disclosed system selects a reject inference model by performing simulations on historical prospect data associated with the original dataset. Additionally, the disclosed system uses the selected reject inference model to generate an imputed dataset by generating synthetic outcome data representing simulated outcomes of rejected prospects in the original dataset. The disclosed system then uses the imputed dataset to modify the lead scoring model by modifying at least one parameter of the lead scoring model using the synthetic outcome data.
Opening claim text (preview).
What is claimed is: 1. In a digital medium environment for classifying lead prospects, a computer-implemented method for using reject inference to generate synthetic data for modify machine-learning lead scoring models comprising: identifying, by at least one processor, an original dataset corresponding to an output of a machine-learning lead scoring model that generates scores for a plurality of prospects from the original dataset, the scores indicating a likelihood of success of prospects of the plurality of prospects; determining, by the at least one processor, a success rate threshold by performing a plurality of simulations that utilize the machine-learning lead scoring model with different parameters to determine success rates of historical data associated with the original dataset; determining, by the at least one processor, that a success rate based on a reject rate and a mislabel rate of the original dataset does not meet the success rate threshold; selecting, by the at least one processor, a reject inference model from a plurality of reject inference models in response to the success rate not meeting the success rate threshold, each reject inference model of the plurality of reject inference models comprising a model for generating synthetic outcomes for reject data in the original dataset having no outcome data; generating, by the at least one processor and based on the original dataset, an imputed dataset comprising synthetic outcome data for reject data in a subset of the plurality of prospects to augment the original dataset by using the reject inference model to generate synthetic outcome labels for the reject data in the subset of the plurality of prospects; and updating, by the at least one processor, the machine-learning lead scoring model using the imputed dataset by modifying at least one parameter of the machine-learning lead scoring model based on synthetic outcome data of the imputed dataset. 2. The computer-implemented method as recited in claim 1 , wherein selecting the reject inference model from the plurality of reject inference models comprises selecting a simple augmentation model for augmenting the original dataset. 3. The computer-implemented method as recited in claim 1 , wherein selecting the reject inference model from the plurality of reject inference models comprises selecting a fuzzy augmentation model for augmenting the original dataset. 4. The computer-implemented method as recited in claim 1 , further comprising: determining a plurality of characteristics of the original dataset, the plurality of characteristics comprising a split effectiveness of the machine-learning lead scoring model for the original dataset, the success rate of the original dataset, and a size of a set of known labels in the original dataset; and generating the imputed dataset comprising the synthetic outcome data in response to determining that one or more of the plurality of characteristics do not meet one or more characteristic thresholds. 5. A non-transitory computer readable storage medium comprising instructions that, when executed by at least one processor, cause a computer system to: identify an original dataset corresponding to an output of a machine-learning lead scoring model that generates scores for a plurality of prospects from the original dataset, the scores indicating a likelihood of success of prospects of the plurality of prospects; determine a success rate threshold by performing a plurality of simulations that utilize the machine-learning lead scoring model with different parameters to determine success rates of historical data associated with the original dataset; determine that a success rate based on a reject rate and a mislabel rate of the original dataset does not meet the success rate threshold; select a reject inference model from a plurality of reject inference models in response to the success rate not meeting the success rate threshold, each reject inference model of the plurality of reject inference models comprising a model for generating synthetic outcomes for reject data in the original dataset having no outcome data; generate, based on the original dataset, an imputed dataset comprising synthetic outcome data for reject data in a subset of the plurality of prospects to augment the original dataset by using the reject inference model to generate synthetic outcome labels for the reject data in the subset of the plurality of prospects; and update the machine-learning lead scoring model using the imputed dataset by modifying at least one parameter of the machine-learning lead scoring model based on the synthetic outcome data of the imputed dataset. 6. The non-transitory computer readable storage medium as recited in claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to determine the success rate threshold by performing the plurality of simulations to determine a threshold that meets a specified accuracy with a specified confidence level based on scoring splits for the original dataset. 7. The non-transitory computer readable storage medium as recited in claim 6 , wherein the plurality of reject inference models comprises a simple augmentation model and a fuzzy augmentation model. 8. The non-transitory computer readable storage medium as recited in claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to: identify a characteristic of the original dataset based on the plurality of prospects in the original dataset; determine that the characteristic of the original dataset does not meet a characteristic threshold indicating whether to use the original dataset or to generate the synthetic outcome data; and generate the synthetic outcome data in response to determining that the characteristic of the original dataset does not meet the characteristic threshold. 9. The non-transitory computer readable storage medium as recited in claim 8 , wherein the characteristic comprises a split effectiveness of the machine-learning lead scoring model for the original dataset, the success rate of the original dataset, or a size of a set of known labels in the original dataset. 10. The non-transitory computer readable storage medium as recited in claim 8 , further comprising instructions that, when executed by the at least one processor, cause the computer system to: compare a plurality of characteristics of the original dataset to a plurality of characteristic thresholds; and generate the synthetic outcome data in response to determining that the plurality of characteristics of the original dataset do not meet the plurality of characteristic thresholds. 11. The non-transitory computer readable storage medium as recited in claim 10 , further comprising instructions that, when executed by the at least one processor, cause the computer system to determine the plurality of characteristic thresholds based on the plurality of simulations on the historical data associated with the original dataset. 12. The non-transitory computer readable storage medium as recited in claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to determine a plurality of features of the plurality of prospects for generating the synthetic outcome data, wherein determining the plurality of features comprises: performing a plurality of additional simulations on the historical data associated with the original dataset using variable combinations of the plurality of features; and selecting a set of features based on a performance of the variable combinations of the plurality of featu
Credit; Loans; Processing thereof · CPC title
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Ensemble learning · CPC title
Machine learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.