Token synthesis for machine learning models

US2023419102A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023419102-A1
Application numberUS-202217849391-A
CountryUS
Kind codeA1
Filing dateJun 24, 2022
Priority dateJun 24, 2022
Publication dateDec 28, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes generating a first vector associated with a token using a first set of parameters of a first learning model based on the token, determining a prediction indicating that the token is associated with a first label based on a set of clustering criteria, the first vector, vectors of a first vector set, and vectors of a second vector. The method includes generating a perturbed vector associated with a second label by modifying a value of the first vector and updating the second vector set to comprise the perturbed vector. The method also includes generating a synthesized token associated with the second vector set based on the perturbed vector using a second set of parameters of the first learning model and training a second learning model based on the synthesized token.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for training a classification system to recognize sensitive data by perturbing vectors in a latent space representing sensitive data comprising: obtaining, from a neural network, a network-based prediction indicating that a candidate token associated with a sensitive data label should have a non-sensitive data label; encoding, using an encoder layer of an autoencoder, the candidate token to generate a first vector associated with the candidate token; determining a clustering-based prediction by applying a set of clustering criteria to first distances between the first vector and vectors of a non-sensitive label set in a latent space and second distances between the first vector and vectors of a sensitive label set in the latent space, the clustering-based prediction indicating that the candidate token should have the non-sensitive data label; in response to the clustering-based prediction not matching the sensitive data label associated with the candidate token, generating perturbed vectors associated with the sensitive data label that are similar to the first vector by randomly selecting multiple values of the first vector for each of the perturbed vectors and applying a random threshold-restricted offset to each of the selected multiple values; updating the sensitive label set to comprise the perturbed vectors, wherein applying the set of clustering criteria to the first distances and distances between the first vector and the perturbed vectors indicates that the candidate token is associated with the sensitive data label; decoding, with a decoder layer of the autoencoder, the perturbed vectors to generate synthesized sensitive tokens associated with the sensitive label set; and retraining the neural network based on the synthesized sensitive tokens, wherein the neural network associates the candidate token with the sensitive data label after the retraining. 2 . The method of claim 1 , further comprising: in response to detecting that the candidate token is not matched with the sensitive data label, determining a subset of vectors that are closest to the first vector in the latent space; decoding, with the decoder layer, the subset of vectors to determine a subset of tokens; and storing the subset of tokens in memory in association with the clustering-based prediction. 3 . The method of claim 1 , further comprising: obtaining a subsequent token after updating the sensitive label set to comprise the perturbed vectors; determining a second vector based on the subsequent token using the encoder layer; determining a first score based on distances between the second vector and the non-sensitive label set; determining a second score based on distances between the second vector and the sensitive label set, wherein the least distance between the second vector and the sensitive label set is a distance between the second vector and a vector of the perturbed vectors; associating the subsequent token with the sensitive data label based on the first and second scores; and displaying an association between the second vector and the first vector on a user interface based on a determination that the perturbed vector is generated from the first vector. 4 . The method of claim 1 , wherein determining the clustering-based prediction indicating that the candidate token is associated with the non-sensitive data label comprises: determining a first score based on a count of vectors in the non-sensitive label set in the k closest vectors to the first vector, wherein k is an integer; determining a second score based on a count of vectors in the sensitive label set in the k closest vectors to the first vector; and associating the candidate token with the non-sensitive data label based on a determination that the first score is greater than the second score by a certain threshold. 5 . One or more tangible, non-transitory, machine-readable media storing instructions that, when executed by one or more processors, effectuate operations comprising: generating a first vector in a latent space associated with a token using a first set of neural network layers based on the token; determining a prediction indicating that the token is associated with a first label based on a set of clustering criteria, the first vector, a first vector set associated with the first label, and a second vector set associated with a second label, wherein the token is associated with a second label in a dataset, and wherein the first and second labels are mutually exclusive with respect to each other; in response to a determination that the prediction does not match the association between the token and the second label stored in the dataset, generating a perturbed vector associated with the second label by modifying a value of the first vector based on an offset parameter; updating the second vector set to comprise the perturbed vector, wherein applying the set of clustering criteria to the first vector and the perturbed vector indicates that the token is associated with the second label; generating a synthesized token associated with the second vector set based on the perturbed vector using a second set of neural network layers; and training a machine learning model based on the synthesized token. 6 . The one or more tangible, non-transitory, machine-readable media of claim 5 , wherein the perturbed vector is a first perturbed vector, the operations further comprising: obtaining a set of flagged character sets; detecting that a character set of the token matches with a flagged character set of the set of flagged character sets; segmenting the token into a plurality of sub-tokens based on the character set, wherein: generating the perturbed vector comprises generating the first perturbed vector based on a first sub-token of the plurality of sub-tokens and generating a second perturbed vector based on a second sub-token of the plurality of sub-tokens; and generating the synthesized token comprises: generating a first synthesized sub-token based on the first sub-token; generating a second synthesized sub-token based on the second sub-token; and generating the synthesized token by concatenating the first synthesized sub-token, the character set, and the second sub-token. 7 . The one or more tangible, non-transitory, machine-readable media of claim 5 , the operations further comprising: providing a second vector of the first vector set to the machine learning model to determine that the first label is associated with the second vector; and associating the first vector set with the first label. 8 . The one or more tangible, non-transitory, machine-readable media of claim 5 , the operations further comprising displaying a first point representing the first vector, first group of points representing the first vector set, and a second group of points representing the second vector set on a user interface, wherein: the first group of points is depicted with a first color; the second group of points is shown in a second color different from the first color; and the first point is shown in a third color, wherein the first point is closest to the first group of points. 9 . The one or more tangible, non-transitory, machine-readable media of claim 5 , the operations further comprising: storing a first set of model parameters for the machine learning model before the training of the machine learning model; storing a second set of model parameters for the machine learning model after the training of the machine learning model; determining that a rule indicates that a second token is associated with the first label; providing the second token to the machine learning model using the first set of mode

Assignees

Inventors

Classifications

  • G06N3/08Primary

    Learning methods · CPC title

  • G06N7/01Primary

    Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Non-supervised learning, e.g. competitive learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023419102A1 cover?
A method includes generating a first vector associated with a token using a first set of parameters of a first learning model based on the token, determining a prediction indicating that the token is associated with a first label based on a set of clustering criteria, the first vector, vectors of a first vector set, and vectors of a second vector. The method includes generating a perturbed vect…
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).