Systems and methods for preventing sensitive data leakage during label propagation

US2025390605A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025390605-A1
Application numberUS-202519316864-A
CountryUS
Kind codeA1
Filing dateSep 2, 2025
Priority dateAug 21, 2023
Publication dateDec 25, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for novel uses and/or improvements to data labeling applications, particularly data labeling applications involving sensitive data. As one example, systems and methods are described herein for preventing sensitive data leakage, using weak learner libraries, during label propagation.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system for preventing sensitive data leakage, using weak learner libraries and a plurality of environments, during label propagation, the system comprising: one or more processors; and one or more non-transitory, computer-readable mediums comprising instructions that when executed by the one or more processors causes operations comprising: receiving a first data set at a first environment, wherein the first data set comprises a plurality of sensitive characteristics, wherein the first data set comprises actual data; generating a second data set at a second environment, wherein the second data set is a synthetic data set corresponding to the first data set; determining, based on the second data set at the second environment, a first learner for a first labeling task of a plurality of labeling tasks specific to the first data set; validating, based on the first data set at the first environment, the first learner; in response to validating the first learner at the first environment, adding the first learner to a first learner library for the first data set; determining, based on the second data set at the second environment, a second learner for a second labeling task of the plurality of labeling tasks, wherein the second learner has a second learning capability; validating, based on the first data set at the first environment, the second learner; adding the second learner to the first learner library in response to validating the second learner; determining, for the first learner library, an aggregate labeling performance for the plurality of labeling tasks specific to the first data set; comparing the aggregate labeling performance to a threshold aggregate performance; and determining whether to approve the first learner library for use based on comparing the aggregate labeling performance to the threshold aggregate performance. 2 . A method for preventing sensitive data leakage, using weak learner libraries, during label propagation, the method comprising: receiving a first data set, wherein the first data set comprises a plurality of sensitive characteristics; generating a second data set, wherein the second data set is a synthetic data set corresponding to the first data set; determining, based on the second data set, a first learner for a first labeling task, wherein the first labeling task is specific to the first data set, and wherein the first learner has a first learning capability; validating, based on the first data set, the first learner; and in response to validating the first learner, adding the first learner to a first learner library for the first data set. 3 . The method of claim 2 , wherein generating the second data set further comprises: retrieving a first latent representation of a first characteristic from the first data set; comparing the first latent representation to characteristics of the second data set to determine whether first sensitive data of the first data set has been leaked; and determining whether to approve the second data set for use based on whether first sensitive data of the first data set has been leaked. 4 . The method of claim 2 , wherein determining the first learner for the first labeling task further comprises: retrieving a second characteristic from the first data set; comparing the second characteristic to characteristics of the first learner to determine whether second sensitive data of the first data set has been leaked; and determining whether to approve the first learner for use based on whether the second sensitive data of the first data set has been leaked. 5 . The method of claim 2 , wherein validating, based on the first data set, the first learner further comprises: determining, for the first learner, a labeling performance of the first labeling task; comparing the labeling performance to a threshold performance; and determining whether to approve the first learner for use based on comparing the labeling performance to the threshold performance. 6 . The method of claim 2 , further comprising: generating for display a recommendation related to the additional weak learner to the first learner library by: determining, based on the second data set, a second learner for a second labeling task, wherein the second learner has a second learning capability; validating, based on the first data set, the second learner; and recommending adding the second learner to the first learner library in response to validating the second learner. 7 . The method of claim 2 , wherein validating the first learner library further comprises: determining, for the first learner library, an aggregate labeling performance for a plurality of labeling tasks specific to the first data set; comparing the aggregate labeling performance to a threshold aggregate performance; and determining whether to approve the first learner library for use based on comparing the aggregate labeling performance to the threshold aggregate performance. 8 . The method of claim 2 , wherein validating the first learner library further comprises: determining a first weight for the first learner; determining a second weight for a second learner in the first learner library; and determining, based on the first weight and the second weight, an aggregate labeling performance for a plurality of labeling tasks specific to the first data set. 9 . The method of claim 8 , wherein determining the first weight for the first learner further comprises: determining, for the first learner, a labeling performance of the first labeling task; and determining the first weight based on the labeling performance. 10 . The method of claim 2 , wherein generating the second data set further comprises: determining a statistical property of the first data set; and generating the synthetic data set for the second data set using a random number generator and the statistical property. 11 . The method of claim 2 , wherein generating the second data set further comprises: determining the first data set is tabular data; and in response to determining that the first data set is tabular data, selecting a first interpolation algorithm for generating the second data set. 12 . The method of claim 2 , wherein generating the second data set further comprises: determining a correlation structure of the first data set; and determining the synthetic data set for the second data set using a copula model and the correlation structure. 13 . The method of claim 2 , wherein determining, based on the second data set, the first learner for the first labeling task further comprises: determining a third characteristic from the second data set; determining an importance of the third characteristic; and selecting the third characteristic as a feature for the first learner based on the importance. 14 . The method of claim 13 , wherein selecting the third characteristic as the feature for the first learner based on the importance further comprises: determining a first value for a first classification in the first labeling task; determining a second value for a second classification in the first labeling task; and determining a threshold value for the feature based on maximizing a difference between the first value and the second value. 15 . The method of claim 14 , wherein determining the threshold value for the feature based on maximizing the difference between the first value and the second value further comprises: determining a first classification error for the first classification; and further determining the first value based o

Assignees

Inventors

Classifications

  • Protecting personal data, e.g. for financial or medical purposes · CPC title

  • by anonymising data, e.g. decorrelating personal data from the owner's identification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025390605A1 cover?
Systems and methods for novel uses and/or improvements to data labeling applications, particularly data labeling applications involving sensitive data. As one example, systems and methods are described herein for preventing sensitive data leakage, using weak learner libraries, during label propagation.
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06F21/6254. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 25 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).