System and method for synthesizing data
US-10423890-B1 · Sep 24, 2019 · US
US2020027157A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2020027157-A1 |
| Application number | US-201816037700-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jul 17, 2018 |
| Priority date | Jul 17, 2018 |
| Publication date | Jan 23, 2020 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods, systems, and non-transitory computer readable storage media are disclosed for using reject inference to generate synthetic data for modifying lead scoring models. For example, the disclosed system identifies an original dataset corresponding to an output of a lead scoring model that generates scores for a plurality of prospects to indicate a likelihood of success of prospects of the plurality of prospects. In one or more embodiments, the disclosed system selects a reject inference model by performing simulations on historical prospect data associated with the original dataset. Additionally, the disclosed system uses the selected reject inference model to generate an imputed dataset by generating synthetic outcome data representing simulated outcomes of rejected prospects in the original dataset. The disclosed system then uses the imputed dataset to modify the lead scoring model by modifying at least one parameter of the lead scoring model using the synthetic outcome data.
Opening claim text (preview).
What is claimed is: 1 . In a digital medium environment for classifying lead prospects, a computer-implemented method for using reject inference to generate synthetic data for modify lead scoring models comprising: identifying, by at least one processor, an original dataset corresponding to an output of a lead scoring model that generates scores for a plurality of prospects, the scores indicating a likelihood of success of prospects of the plurality of prospects; a step for generating an imputed dataset by selecting a reject inference model from a plurality of reject inference models and generating outcome data by performing a plurality of simulations; and updating the lead scoring model using the imputed dataset by modifying at least one parameter of the lead scoring model based on synthetic outcome data of the imputed dataset. 2 . The computer-implemented method as recited in claim 1 , wherein the step for generating the imputed dataset by selecting a reject inference model from a plurality of reject inference models and generating outcome data by performing a plurality of simulations comprises selecting a simple augmentation model for augmenting the original dataset. 3 . The computer-implemented method as recited in claim 1 , wherein the step for generating the imputed dataset by selecting a reject inference model from a plurality of reject inference models and generating outcome data by performing a plurality of simulations comprises selecting a fuzzy augmentation model for augmenting the original dataset. 4 . The computer-implemented method as recited in claim 1 , further comprising identifying a plurality of characteristics of the original dataset, the plurality of characteristics comprising a split effectiveness of the lead scoring model for the original dataset, a success rate of the original dataset, and a size of a set of known labels in the original dataset. 5 . A non-transitory computer readable storage medium comprising instructions that, when executed by at least one processor, cause a computer system to: identify an original dataset corresponding to an output of a lead scoring model that generates scores for a plurality of prospects, the scores indicating a likelihood of success of prospects of the plurality of prospects; generate, based on the original dataset, an imputed dataset using a reject inference model on a subset of the plurality of prospects to generate synthetic outcome data for the subset; and update the lead scoring model using the imputed dataset by modifying at least one parameter of the lead scoring model based on the synthetic outcome data. 6 . The non-transitory computer readable storage medium as recited in claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to select the reject inference model from a plurality of reject inference models by performing a plurality of simulations using the plurality of reject inference models on historical data associated with the original dataset. 7 . The non-transitory computer readable storage medium as recited in claim 6 , wherein the plurality of reject inference models comprises a simple augmentation model and a fuzzy augmentation model. 8 . The non-transitory computer readable storage medium as recited in claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to: identify a characteristic of the original dataset based on the plurality of prospects in the original dataset; determine that the characteristic of the original dataset does not meet a characteristic threshold indicating whether to use the original dataset or to generate the synthetic outcome data; and generate the synthetic outcome data in response to determining that the characteristic of the original dataset does not meet the characteristic threshold. 9 . The non-transitory computer readable storage medium as recited in claim 8 , wherein the characteristic comprises a split effectiveness of the lead scoring model for the original dataset, a success rate of the original dataset, or a size of a set of known labels in the original dataset. 10 . The non-transitory computer readable storage medium as recited in claim 8 , further comprising instructions that, when executed by the at least one processor, cause the computer system to: compare a plurality of characteristics of the original dataset to a plurality of characteristic thresholds; and generate the synthetic outcome data in response to determining that the plurality of characteristics of the original dataset do not meet the plurality of characteristic thresholds. 11 . The non-transitory computer readable storage medium as recited in claim 10 , further comprising instructions that, when executed by the at least one processor, cause the computer system to determine the plurality of characteristic thresholds based on historical data associated with the original dataset. 12 . The non-transitory computer readable storage medium as recited in claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to determine a plurality of features of the plurality of prospects for generating the synthetic outcome data, wherein determining the plurality of features comprises: performing a plurality of simulations on historical data associated with the original dataset using variable combinations of the plurality of features; and selecting a set of features based on a performance of the variable combinations of the plurality of features in the plurality of simulations. 13 . The non-transitory computer readable storage medium as recited in claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to score a plurality of new prospects using the updated lead scoring model based on the synthetic outcome data. 14 . In a digital medium environment for classifying lead prospects, a system for using reject inference to generate synthetic data for modify lead scoring models comprising: at least one processor; and a non-transitory computer memory comprising: an original dataset comprising data for a plurality of prospects; and instructions that, when executed by the at least one processor, cause the system to: identify an output of a lead scoring model that generates scores for a plurality of prospects, the scores indicating a likelihood of success of each prospect of the plurality of prospects; select a reject inference model from a plurality of reject inference models based on a plurality of simulations performed on historical prospect data associated with the original dataset using the plurality of reject inference models; generate an imputed dataset using the selected reject inference model on a subset of the plurality of prospects corresponding to rejected prospects to generate synthetic outcome data representing simulated outcomes of the subset of the plurality of prospects; and modify the lead scoring model based on the synthetic outcome data of the imputed dataset by modifying at least one parameter of the lead scoring model. 15 . The system as recited in claim 14 , further comprising instructions that, when executed by the at least one processor, cause the system to: identify a plurality of characteristics of the original dataset based on the plurality of prospects in the original dataset; determine that the plurality of characteristics of the original dataset does not meet a plurality of characteristic thresholds indicating whether to use the original d
Ensemble learning · CPC title
Fuzzy inferencing · CPC title
Market modelling; Market analysis; Collecting market data · CPC title
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.