Systems and methods for preventing sensitive data leakage during label propagation

US12406093B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12406093-B2
Application numberUS-202318452630-A
CountryUS
Kind codeB2
Filing dateAug 21, 2023
Priority dateAug 21, 2023
Publication dateSep 2, 2025
Grant dateSep 2, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for novel uses and/or improvements to data labeling applications, particularly data labeling applications involving sensitive data. As one example, systems and methods are described herein for preventing sensitive data leakage, using weak learner libraries, during label propagation.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for preventing sensitive data leakage, using weak learner libraries and a plurality of environments, during label propagation, the system comprising: one or more processors; and one or more non-transitory, computer-readable mediums comprising instructions that when executed by the one or more processors causes operations comprising: receiving a first data set at a first environment, wherein the first data set comprises a plurality of sensitive characteristics, wherein the first data set comprises actual data; generating a second data set at a second environment, wherein the second data set is a synthetic data set corresponding to the first data set; determining, based on the second data set at the second environment, a first weak learner for a first labeling task of a plurality of labeling tasks specific to the first data set; validating, based on the first data set at the first environment, the first weak learner; in response to validating the first weak learner at the first environment, adding the first weak learner to a first weak learner library for the first data set; determining, based on the second data set at the second environment, a second weak learner for a second labeling task of the plurality of labeling tasks; validating, based on the first data set at the first environment, the second weak learner; adding the second weak learner to the first weak learner library in response to validating the second weak learner; determining, for the first weak learner library, an aggregate labeling performance for the plurality of labeling tasks specific to the first data set; comparing the aggregate labeling performance to a threshold aggregate performance; and determining whether to approve the first weak learner library for use based on comparing the aggregate labeling performance to the threshold aggregate performance. 2. A method for preventing sensitive data leakage, using weak learner libraries, during label propagation, the method comprising: receiving a first data set, wherein the first data set comprises a plurality of sensitive characteristics; generating a second data set, wherein the second data set is a synthetic data set corresponding to the first data set; determining, based on the second data set, a first weak learner for a first labeling task, wherein the first labeling task is specific to the first data set; validating, based on the first data set, the first weak learner; in response to validating the first weak learner, adding the first weak learner to a first weak learner library for the first data set; validating, based on the first data set, the first weak learner library; and in response to validating the first weak learner library, generating for display, on a user interface, a recommendation related to an additional weak learner to the first weak learner library. 3. The method of claim 2 , wherein generating the second data set further comprises: retrieving a first latent representation of a first characteristic from the first data set; comparing the first latent representation to characteristics of the second data set to determine whether first sensitive data of the first data set has been leaked; and determining whether to approve the second data set for use based on whether first sensitive data of the first data set has been leaked. 4. The method of claim 2 , wherein determining the first weak learner for the first labeling task further comprises: retrieving a second characteristic from the first data set; comparing the second characteristic to characteristics of first weak learner to determine whether second sensitive data of the first data set has been leaked; and determining whether to approve the first weak learner for use based on whether second sensitive data of the first data set has been leaked. 5. The method of claim 2 , wherein validating, based on the first data set, the first weak learner further comprises: determining, for the first weak learner, a labeling performance of the first labeling task; comparing the labeling performance to a threshold performance; and determining whether to approve the first weak learner for use based on comparing the labeling performance to the threshold performance. 6. The method of claim 2 , wherein generating for display the recommendation related to the additional weak learner to the first weak learner library further comprises: determining, based on the second data set, a second weak learner for a second labeling task; validating, based on the first data set, the second weak learner; and recommending adding the second weak learner to the first weak learner library in response to validating the second weak learner. 7. The method of claim 2 , wherein validating the first weak learner library further comprises: determining, for the first weak learner library, an aggregate labeling performance for a plurality of labeling tasks specific to the first data set; comparing the aggregate labeling performance to a threshold aggregate performance; and determining whether to approve the first weak learner library for use based on comparing the aggregate labeling performance to the threshold aggregate performance. 8. The method of claim 2 , wherein validating the first weak learner library further comprises: determining a first weight for the first weak learner; determining a second weight for a second weak learner in the first weak learner library; and determining, based on the first weight and the second weight, an aggregate labeling performance for a plurality of labeling tasks specific to the first data set. 9. The method of claim 8 , wherein determining the first weight for the first weak learner further comprises: determining, for the first weak learner, a labeling performance of the first labeling task; and determining the first weight based on the labeling performance. 10. The method of claim 2 , wherein generating the second data set further comprises: determining a statistical property of the first data set; and generating the synthetic data set for the second data set using a random number generator and the statistical property. 11. The method of claim 2 , wherein generating the second data set further comprises: determining the first data set is tabular data; and in response to determining that the first data set is tabular data, selecting a first interpolation algorithm for generating the second data set. 12. The method of claim 2 , wherein generating the second data set further comprises: determining a correlation structure of the first data set; and determining the synthetic data set for the second data set using a copula model and the correlation structure. 13. The method of claim 2 , wherein determining, based on the second data set, the first weak learner for the first labeling task further comprises: determining a third characteristic from the second data set; determining an importance of the third characteristic; and selecting the third characteristic as a feature for the first weak learner based on the importance. 14. The method of claim 13 , wherein selecting the third characteristic as the feature for the first weak learner based on the importance further comprises: determining a first value for a first classification in the first labeling task; determining a second value for a second classification in the first labeling task; and determining a threshold value for the feature based on maximizing a difference between the first value and the second value. 15. The method of claim 14 , wherein determining

Assignees

Inventors

Classifications

  • Protecting personal data, e.g. for financial or medical purposes · CPC title

  • by anonymising data, e.g. decorrelating personal data from the owner's identification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12406093B2 cover?
Systems and methods for novel uses and/or improvements to data labeling applications, particularly data labeling applications involving sensitive data. As one example, systems and methods are described herein for preventing sensitive data leakage, using weak learner libraries, during label propagation.
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06F21/6254. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).