Machine learning system and apparatus for sampling labelled data

US11481672B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11481672-B2
Application numberUS-201916423315-A
CountryUS
Kind codeB2
Filing dateMay 28, 2019
Priority dateNov 29, 2018
Publication dateOct 25, 2022
Grant dateOct 25, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A database including various datasets and metadata associated with each respective dataset is provided. These datasets were used to train predictive models. The database stores a performance value associated with the model trained with each dataset. When provided with a new dataset, a server can determine various metadata for the new dataset. Using the metadata, the server can search the database and retrieve datasets which have similar metadata values. The server can narrow the search based on the performance value associated with the dataset. Based on the retrieved datasets, the server can recommend at least one sampling technique. The sampling technique can be determined based on the one or more sampling techniques that were used in association with the retrieved datasets.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method comprising: receiving, by a transceiver of a server, a first dataset including labeled data points belonging to two classes, a first number of labeled data points belonging to a first class is larger than a second number of labeled data points belonging to a second class; calculating, using a processor of the server, a first metadata value for the first dataset, wherein the first metadata value is a weighted average of a standard deviation, an average and a median of the labeled data points; selecting, using the processor, a selected sampling technique associated with a selected dataset, wherein: the selected dataset is one of a plurality of datasets stored in a database; each of the plurality of datasets includes selected data points and is associated with a metadata value, a sampling technique and a performance value, wherein the metadata value is another weighted average of another standard deviation, another average and another median of the selected data points of the respective dataset; the performance value is a measure of efficacy of a predictive model trained with the respective dataset and is specificity; and the first metadata value matches the metadata value associated with the selected dataset; and sampling, using the processor, the first dataset using the selected sampling technique to generate a new subset, wherein the selected sampling technique is a combination of Random Under-Sampling of the first class of data points by discarding a plurality of the labeled data points of the first class and Modified Synthetic Minority Over-Sampling the second class of data points by multiplying a plurality of the labeled data points of the second class. 2. The method of claim 1 , wherein the selected sampling technique is one of the following: the sampling technique associated with the selected dataset; or based on the sampling technique associated with the selected dataset. 3. The method of claim 1 , wherein the performance value associated with the selected dataset is higher than a threshold value. 4. The method of claim 3 , wherein the performance value is additionally one of accuracy, precision, recall, or area under a curve. 5. The method of claim 4 , wherein the performance value is an area under a curve and the threshold value is 0.8. 6. The method of claim 1 , further comprising providing the new subset to a classifier as training data. 7. The method of claim 6 , wherein the classifier uses the training data to train a predictive model. 8. The device of claim 1 , wherein the selected sampling technique further includes one or a combination of the following: Synthetic Minority Over-sampling Technique; or Random Over-Sampling. 9. The method of claim 1 , wherein the sampling technique is: Synthetic Minority Over-sampling Technique; Modified synthetic minority oversampling technique; Random Under-Sampling; or Random Over-Sampling. 10. The method of claim 1 , wherein the first metadata value matches the metadata value associated with the selected dataset only if: the first metadata value is equal to the metadata value; or the first metadata value is within a tolerance range of the metadata value. 11. A device comprising: a processor, a memory, a reader, a transceiver and a display, wherein: the transceiver is configured to receive a payment request including a payment amount and an account number from a terminal; and the transceiver is configured to transmit a message to the terminal, wherein: the message is created by the processor using a predictive model and the predictive model was trained using training data, the training data was a subset of a first dataset sampled according to a selected sampling technique; the selected sampling technique is associated with a selected dataset, wherein in the selected dataset, a first number of labeled data points belonging to a first class is larger than a second number of labeled data points belonging to a second class; and the selected dataset is one of a plurality of datasets stored in a database, each dataset including selected data points and being associated with a sampling technique, a metadata value and a performance value, wherein: the performance value is a measure of efficacy of a model trained with the sampling technique associated with the respective dataset and is specificity; the metadata value is a weighted average of a standard deviation, an average and a median of the selected data points of the respective dataset; the selected sampling technique is a combination of Random Under-Sampling of the first class of data points by discarding a plurality of the labeled data points of the first class and Modified Synthetic Minority Over-Sampling the second class of data points by multiplying a plurality of the labeled data points of the second class; and a first metadata value of the first dataset matches the metadata value of the selected dataset. 12. The device of claim 11 , wherein the selected sampling technique is: the sampling technique associated with the selected dataset; or based on the sampling technique associated with the selected dataset. 13. The device of claim 11 , wherein the performance value associated with the selected dataset is higher than a threshold value. 14. The device of claim 11 , wherein the selected sampling technique is one or more of the following: Synthetic Minority Over-sampling Technique; or Random Over-Sampling. 15. A system comprising: a server; and a terminal including a processor, a memory, a reader, a transceiver and a display, wherein: the reader is configured to scan a payment card for an account number; the transceiver is configured to transmit a payment request including a payment amount and the account number to the server; and the transceiver is configured to receive a message from the server, wherein: the message is created by the server using a predictive model and the predictive model was trained using training data, the training data was a subset of a first dataset sampled according to a selected sampling technique; the selected sampling technique is associated with a selected dataset, wherein in the selected dataset, a first number of labeled data points belonging to a first class is larger than a second number of labeled data points belonging to a second class; the selected dataset is one of a plurality of datasets stored in a database, each dataset including selected data points and being associated with a sampling technique, a metadata value and a performance value; the performance value is a measure of efficacy of a model trained with the sampling technique associated with the respective dataset and is specificity; a first metadata value of the first dataset matches the metadata value of the selected dataset; wherein the selected sampling technique is a combination of Random Under-Sampling of the first class of data points by discarding a plurality of the labeled data points of the first class and Modified Synthetic Minority Over-Sampling the second class of data points by multiplying a plurality of the labeled data points of the second class; and the first metadata value is a weighted average of a standard deviation, an average and a median of the selected data points of the respective dataset.

Assignees

Inventors

Classifications

  • G06N20/00Primary

    Machine learning · CPC title

  • using ranking · CPC title

  • using data annotations, e.g. user-defined metadata · CPC title

  • Inference or reasoning models · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11481672B2 cover?
A database including various datasets and metadata associated with each respective dataset is provided. These datasets were used to train predictive models. The database stores a performance value associated with the model trained with each dataset. When provided with a new dataset, a server can determine various metadata for the new dataset. Using the metadata, the server can search the databa…
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 25 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).