What technology area does this patent fall under?

Primary CPC classification G06N20/20. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jan 09 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Curating Training Data For Incremental Re-Training Of A Predictive Model

US2020012963A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2020012963-A1
Application number	US-201916418232-A
Country	US
Kind code	A1
Filing date	May 21, 2019
Priority date	Oct 28, 2014
Publication date	Jan 9, 2020
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In general, embodiments of the present invention provide systems, methods and computer readable media for curating a training data set to ensure that training data being updated continuously from a data reservoir of verified possible training examples remain an accurate, high-quality representation of the distribution of data that are being input to a predictive model for processing.

First claim

Opening claim text (preview).

1 . A computer-implemented method for adaptively improving the performance of a current predictive model, the method comprising: selecting a set of labeled data instances from a labeled data reservoir, the labeled data reservoir comprising a pool of possible training data, wherein selecting the set of labeled data instances is based on a determination that re-training a current predictive model with updated training data likely will result in improved model performance; generating a candidate model using at least one candidate training data set generated based at least in part on a received training data set and the set of labeled data instances; in an instance in which a performance of the candidate model is improved from a performance of the current predictive model, instantiating the candidate training data set and the candidate model. 2 - 13 . (canceled) 14 . A computer program product, stored on a non-transitory computer readable medium, comprising instructions that when executed on one or more computers cause the one or more computers to perform operations comprising: selecting a set of labeled data instances from a labeled data reservoir, the labeled data reservoir comprising a pool of possible training data, wherein selecting the set of labeled data instances is based on a determination that re-training a current predictive model with updated training data likely will result in improved model performance; generating a candidate model using at least one candidate training data set generated based at least in part on a received training data set and the set of labeled data instances; in an instance in which a performance of the candidate model is improved from a performance of the current predictive model, instantiating the candidate training data set and the candidate model. 15 . The computer program product of claim 14 , wherein the labeled data reservoir comprises data that have been collected continuously over time from input data being processed by the current predictive model. 16 . The computer program product of claim 14 , wherein the set of labeled data instances is not included in the received training data set, and wherein each labeled data instance is associated with a true label representing the data instance. 17 . The computer program product of claim 16 , wherein the determination is based at least in part on a distribution and quality of the training data set. 18 . The computer program product of claim 17 , wherein the current predictive model is a classifier predicting to which of a set of predictive categories an input data instance belongs, wherein a true label associated with a labeled data instance identifies the predictive category to which the labeled data instance belongs, and wherein selecting the set of labeled data instances from the labeled data reservoir is based at least in part on maintaining a class balance within the training data. 19 . The computer program product of claim 14 , wherein generating the candidate training data comprises identifying and removing outlier instances. 20 . The computer program product of claim 19 , wherein the current predictive model is a classifier predicting to which of a set of predictive categories an input data instance belongs, and wherein selecting the set of labeled data instances from the labeled data reservoir comprises identifying and removing outlier instances in one predictive category. 21 . The computer program product of claim 14 , wherein the labeled data reservoir comprises labeled data instances that are received from multiple sources, and wherein selecting a labeled data instance from the set of labeled data instances comprises: selecting the labeled data instance in an instance in which a source of the labeled data instance matches with a pre-determined source. 22 . The computer program product of claim 14 , wherein generating at least one candidate training data set is based on a greedy algorithm, the generating comprising: generating a first candidate training data set by adding a first subset of the labeled data instances to the training data; and generating a second candidate training data set by adding a second subset of the labeled data instances to the first candidate training data set. 23 . The computer program product of claim 14 , wherein generating at least one candidate training data set is based on a non-greedy algorithm, the generating comprising: replacing the training data with a subset of the labeled data instances. 24 . (canceled) 25 . The computer program product of claim 24 , wherein generating the assessment comprises calculating a cross-validation between the candidate model performance and the current predictive model performance. 26 . The computer program product of claim 24 , wherein there are multiple candidate models, and wherein generating the assessment for each of the multiple candidate models is implemented in parallel. 27 . A system, comprising: one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: selecting a set of labeled data instances from a labeled data reservoir, the labeled data reservoir comprising a pool of possible training data, wherein selecting the set of labeled data instances is based on a determination that re-training a current predictive model with updated training data likely will result in improved model performance; generating a candidate model using at least one candidate training data set generated based at least in part on a received training data set and the set of labeled data instances; in an instance in which a performance of the candidate model is improved from a performance of the current predictive model, instantiating the candidate training data set and the candidate model. 28 . The system of claim 27 , wherein the labeled data reservoir comprises data that have been collected continuously over time from input data being processed by the current predictive model. 29 . The system of claim 27 , wherein the set of labeled data instances is not included in the received training data set, and wherein each labeled data instance is associated with a true label representing the data instance. 30 . The system of claim 29 , wherein the determination is based at least in part on a distribution and quality of the training data. 31 . The system of claim 30 , wherein the current predictive model is a classifier predicting to which of a set of predictive categories an input data instance belongs, wherein a true label associated with a labeled data instance identifies the predictive category to which the labeled data instance belongs, and wherein selecting the set of labeled data instances from the labeled data reservoir is based at least in part on maintaining a class balance within the training data. 32 . The system of claim 27 , wherein generating the candidate training data comprises identifying and removing outlier instances. 33 . The system of claim 32 , wherein the current predictive model is a classifier predicting to which of a set of predictive categories an input data instance belongs, and wherein selecting the set of labeled data instances from the labeled data reservoir comprises identifying and removing outlier instances in one predictive category. 34 . The system of claim 27 , wherein the labeled data reservoir co

Assignees

Groupon Inc

Inventors

Classifications

G06N20/20Primary
Ensemble learning · CPC title
G06N20/00Primary
Machine learning · CPC title
G06N5/04
Inference or reasoning models · CPC title

Patent family

Related publications grouped by family.

View patent family 67069365

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2020012963A1 cover?: In general, embodiments of the present invention provide systems, methods and computer readable media for curating a training data set to ensure that training data being updated continuously from a data reservoir of verified possible training examples remain an accurate, high-quality representation of the distribution of data that are being input to a predictive model for processing.
Who is the assignee on this patent?: Groupon Inc
What technology area does this patent fall under?: Primary CPC classification G06N20/20. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jan 09 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).