Counter data generation for data profiling using only true samples

US12112268B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12112268-B2
Application numberUS-202318136830-A
CountryUS
Kind codeB2
Filing dateApr 19, 2023
Priority dateMar 6, 2019
Publication dateOct 8, 2024
Grant dateOct 8, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for generating a dual-class dataset is disclosed. A single-class dataset and a context dataset are obtained. The context dataset can be labeled. A model can be trained using the combination of the single-class dataset and the labeled context dataset. The model can be run on the context dataset. The data points that are classified the same as the data points included in the single-class dataset, can be removed from the labeled context dataset and added to the single-class dataset. These steps can be repeated until no data points are classified by the model.

First claim

Opening claim text (preview).

The invention claimed is: 1. A non-transitory computer-accessible medium having stored thereon computer-executable instructions for generating a first dual-class dataset, wherein, when a computing hardware arrangement executes the instructions, the computing arrangement is configured to perform procedures comprising: (a) accessing a first dataset including data points belonging to a first category of data points; (b) accessing a second dataset including data points belonging to the first category of data points and a second category of data points; (c) labeling each data point in the first dataset with a first label to generate a first labeled dataset, and labeling each data point in the second dataset with a second label to generate a second labeled dataset; (d) training a classification model using the first labeled dataset and the second labeled dataset; (e) using the classification model, classifying each data point in the second labeled dataset as belonging to one of the first category of data points or the second category of data points; (f) for each data point in the second labeled dataset classified as belonging to the first category of data points, removing the data point from the second dataset and adding the data point to the first dataset; and (g) generating the first dual-class dataset using the first dataset and the second dataset. 2. The non-transitory computer-accessible medium of claim 1 , further configured to perform procedures comprising: repeating steps (c)-(g) to generate a second dual-class dataset. 3. The non-transitory computer-accessible medium of claim 2 , wherein in repeating the step (d), a second classification model is used. 4. The non-transitory computer-accessible medium of claim 1 , further configured to perform procedures comprising: continue repeating steps (c)-(g) to generate a new dual-class dataset until the new dual-class dataset is the same as the dual-class dataset from a prior run. 5. The non-transitory computer-accessible medium of claim 1 , further configured to perform procedures comprising: prior to accessing a second dataset, generating the second dataset by scarping data from Internet pages, Internet websites or databases. 6. The non-transitory computer-accessible medium of claim 1 , wherein the first dataset includes data points relating to fraudulent transactions. 7. The non-transitory computer-accessible medium of claim 1 , wherein the first dataset includes data points relating to phone numbers. 8. The non-transitory computer-accessible medium of claim 1 , wherein the first category of data includes telephone numbers and the second category of data includes non-telephone number text. 9. The non-transitory computer-accessible medium of claim 1 , further configured to perform procedures comprising: sampling the first dual-class dataset according to a sampling technique. 10. The non-transitory computer-accessible medium of claim 9 , wherein the sampling technique is undersampling or oversampling the data points in the first dataset. 11. The non-transitory computer-accessible medium of claim 9 , wherein the sampling technique is undersampling or oversampling the data points in the second dataset. 12. The non-transitory computer-accessible medium of claim 9 , wherein the sampling technique is Synthetic Minority Oversampling Technique. 13. The non-transitory computer-accessible medium of claim 9 , wherein the sampling technique is Modified Synthetic Minority Oversampling Technique. 14. The non-transitory computer-accessible medium of claim 9 , wherein the sampling technique is Random Undersampling. 15. The non-transitory computer-accessible medium of claim 9 , wherein the sampling technique is Random Oversampling. 16. The non-transitory computer-accessible medium of claim 1 , further configured to perform procedures comprising: calculating a performance value for the classification model. 17. The non-transitory computer-accessible medium of claim 16 , wherein the performance value is an area under a curve. 18. The non-transitory computer-accessible medium of claim 16 , wherein the performance value is an accuracy rate. 19. The non-transitory computer-accessible medium of claim 16 , wherein the performance value is a precision rate. 20. The non-transitory computer-accessible medium of claim 16 , wherein the performance value is a recall rate.

Assignees

Inventors

Classifications

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Feedforward networks · CPC title

  • Supervised learning · CPC title

  • Combinations of networks · CPC title

  • Text processing (natural language analysis G06F40/20; semantic analysis G06F40/30; processing or translation of natural language G06F40/40) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12112268B2 cover?
A method for generating a dual-class dataset is disclosed. A single-class dataset and a context dataset are obtained. The context dataset can be labeled. A model can be trained using the combination of the single-class dataset and the labeled context dataset. The model can be run on the context dataset. The data points that are classified the same as the data points included in the single-class…
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 08 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).