Generating synthetic data using reject inference processes for modifying lead scoring models

US11514515B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11514515-B2
Application numberUS-201816037700-A
CountryUS
Kind codeB2
Filing dateJul 17, 2018
Priority dateJul 17, 2018
Publication dateNov 29, 2022
Grant dateNov 29, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and non-transitory computer readable storage media are disclosed for using reject inference to generate synthetic data for modifying lead scoring models. For example, the disclosed system identifies an original dataset corresponding to an output of a lead scoring model that generates scores for a plurality of prospects to indicate a likelihood of success of prospects of the plurality of prospects. In one or more embodiments, the disclosed system selects a reject inference model by performing simulations on historical prospect data associated with the original dataset. Additionally, the disclosed system uses the selected reject inference model to generate an imputed dataset by generating synthetic outcome data representing simulated outcomes of rejected prospects in the original dataset. The disclosed system then uses the imputed dataset to modify the lead scoring model by modifying at least one parameter of the lead scoring model using the synthetic outcome data.

First claim

Opening claim text (preview).

What is claimed is: 1. In a digital medium environment for classifying lead prospects, a computer-implemented method for using reject inference to generate synthetic data for modify machine-learning lead scoring models comprising: identifying, by at least one processor, an original dataset corresponding to an output of a machine-learning lead scoring model that generates scores for a plurality of prospects from the original dataset, the scores indicating a likelihood of success of prospects of the plurality of prospects; determining, by the at least one processor, a success rate threshold by performing a plurality of simulations that utilize the machine-learning lead scoring model with different parameters to determine success rates of historical data associated with the original dataset; determining, by the at least one processor, that a success rate based on a reject rate and a mislabel rate of the original dataset does not meet the success rate threshold; selecting, by the at least one processor, a reject inference model from a plurality of reject inference models in response to the success rate not meeting the success rate threshold, each reject inference model of the plurality of reject inference models comprising a model for generating synthetic outcomes for reject data in the original dataset having no outcome data; generating, by the at least one processor and based on the original dataset, an imputed dataset comprising synthetic outcome data for reject data in a subset of the plurality of prospects to augment the original dataset by using the reject inference model to generate synthetic outcome labels for the reject data in the subset of the plurality of prospects; and updating, by the at least one processor, the machine-learning lead scoring model using the imputed dataset by modifying at least one parameter of the machine-learning lead scoring model based on synthetic outcome data of the imputed dataset. 2. The computer-implemented method as recited in claim 1 , wherein selecting the reject inference model from the plurality of reject inference models comprises selecting a simple augmentation model for augmenting the original dataset. 3. The computer-implemented method as recited in claim 1 , wherein selecting the reject inference model from the plurality of reject inference models comprises selecting a fuzzy augmentation model for augmenting the original dataset. 4. The computer-implemented method as recited in claim 1 , further comprising: determining a plurality of characteristics of the original dataset, the plurality of characteristics comprising a split effectiveness of the machine-learning lead scoring model for the original dataset, the success rate of the original dataset, and a size of a set of known labels in the original dataset; and generating the imputed dataset comprising the synthetic outcome data in response to determining that one or more of the plurality of characteristics do not meet one or more characteristic thresholds. 5. A non-transitory computer readable storage medium comprising instructions that, when executed by at least one processor, cause a computer system to: identify an original dataset corresponding to an output of a machine-learning lead scoring model that generates scores for a plurality of prospects from the original dataset, the scores indicating a likelihood of success of prospects of the plurality of prospects; determine a success rate threshold by performing a plurality of simulations that utilize the machine-learning lead scoring model with different parameters to determine success rates of historical data associated with the original dataset; determine that a success rate based on a reject rate and a mislabel rate of the original dataset does not meet the success rate threshold; select a reject inference model from a plurality of reject inference models in response to the success rate not meeting the success rate threshold, each reject inference model of the plurality of reject inference models comprising a model for generating synthetic outcomes for reject data in the original dataset having no outcome data; generate, based on the original dataset, an imputed dataset comprising synthetic outcome data for reject data in a subset of the plurality of prospects to augment the original dataset by using the reject inference model to generate synthetic outcome labels for the reject data in the subset of the plurality of prospects; and update the machine-learning lead scoring model using the imputed dataset by modifying at least one parameter of the machine-learning lead scoring model based on the synthetic outcome data of the imputed dataset. 6. The non-transitory computer readable storage medium as recited in claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to determine the success rate threshold by performing the plurality of simulations to determine a threshold that meets a specified accuracy with a specified confidence level based on scoring splits for the original dataset. 7. The non-transitory computer readable storage medium as recited in claim 6 , wherein the plurality of reject inference models comprises a simple augmentation model and a fuzzy augmentation model. 8. The non-transitory computer readable storage medium as recited in claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to: identify a characteristic of the original dataset based on the plurality of prospects in the original dataset; determine that the characteristic of the original dataset does not meet a characteristic threshold indicating whether to use the original dataset or to generate the synthetic outcome data; and generate the synthetic outcome data in response to determining that the characteristic of the original dataset does not meet the characteristic threshold. 9. The non-transitory computer readable storage medium as recited in claim 8 , wherein the characteristic comprises a split effectiveness of the machine-learning lead scoring model for the original dataset, the success rate of the original dataset, or a size of a set of known labels in the original dataset. 10. The non-transitory computer readable storage medium as recited in claim 8 , further comprising instructions that, when executed by the at least one processor, cause the computer system to: compare a plurality of characteristics of the original dataset to a plurality of characteristic thresholds; and generate the synthetic outcome data in response to determining that the plurality of characteristics of the original dataset do not meet the plurality of characteristic thresholds. 11. The non-transitory computer readable storage medium as recited in claim 10 , further comprising instructions that, when executed by the at least one processor, cause the computer system to determine the plurality of characteristic thresholds based on the plurality of simulations on the historical data associated with the original dataset. 12. The non-transitory computer readable storage medium as recited in claim 5 , further comprising instructions that, when executed by the at least one processor, cause the computer system to determine a plurality of features of the plurality of prospects for generating the synthetic outcome data, wherein determining the plurality of features comprises: performing a plurality of additional simulations on the historical data associated with the original dataset using variable combinations of the plurality of features; and selecting a set of features based on a performance of the variable combinations of the plurality of featu

Assignees

Inventors

Classifications

  • G06Q40/03Primary

    Credit; Loans; Processing thereof · CPC title

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Ensemble learning · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11514515B2 cover?
Methods, systems, and non-transitory computer readable storage media are disclosed for using reject inference to generate synthetic data for modifying lead scoring models. For example, the disclosed system identifies an original dataset corresponding to an output of a lead scoring model that generates scores for a plurality of prospects to indicate a likelihood of success of prospects of the pl…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06Q40/03. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 29 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).