Context-based pattern matching for sensitive data detection

US2023418971A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023418971-A1
Application numberUS-202217849439-A
CountryUS
Kind codeA1
Filing dateJun 24, 2022
Priority dateJun 24, 2022
Publication dateDec 28, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes generating first patterns indicating a data label and associating a candidate token of a text sequence with the data label by removing first tokens from the text sequence based on a match of the first tokens with a token of second patterns and selecting the candidate token from other tokens of the text sequence based on a match between the candidate token and a token of the second patterns. The method also includes updating a token sequence collection to comprise the candidate token and a context token, updating the second patterns with new patterns that match the candidate token and the context token, and removing a first pattern from the second patterns based on a determination that the first pattern matches with a token sequence associated with the test tokens.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for updating a rules-based model for detecting sensitive data by updating and pruning regex patterns comprising: generating inclusive regex patterns based on sensitive tokens associated with a sensitive data label, wherein each pattern of the inclusive regex patterns matches with at least one token of the sensitive tokens; associating a candidate token of a document with the sensitive data label by removing false positive tokens with exclusive regex patterns from the document and selecting the candidate token from other tokens of the document based on a match between the candidate token and an inclusive regex pattern; in response to receiving a feedback message from a client-side computing device indicating that the candidate token is incorrectly associated with the sensitive data label, updating a token sequence collection to comprise an additional sequence, the additional sequence comprising the candidate token and context tokens surrounding the candidate token; updating the exclusive regex patterns with new patterns that match the additional sequence, wherein, for each respective pattern of the new patterns, a count of sequences of the token sequence collection that matches with the respective pattern is greater than a threshold; and pruning the exclusive regex patterns by removing an exclusive pattern of the exclusive regex patterns based on a determination that the exclusive pattern matches with a token sequence associated with the sensitive tokens. 2 . The method of claim 1 , wherein: the inclusive regex pattern is a first regex pattern; a second inclusive regex pattern of the inclusive regex patterns matches with the token sequence; the exclusive pattern is associated with an exclusive accuracy score indicating a count of matches between the exclusive pattern and any token sequence indicated to be associated with the sensitive data label by a first set of feedback messages; the inclusive regex pattern is associated with an inclusive accuracy score indicating a count of matches between the second inclusive regex pattern and any token sequence indicated as not associated with the sensitive data label by a second set of feedback messages; and removing the exclusive pattern comprises: determining a confidence score based on the exclusive accuracy score and the inclusive accuracy score; and in response to a determination that the confidence score satisfies a confidence threshold, removing the exclusive pattern. 3 . The method of claim 1 , wherein generating the inclusive regex patterns comprises: generating a set of vectors based on the candidate token; and providing the set of vectors to a neural network to generate the inclusive regex patterns. 4 . The method of claim 1 , further comprising: determining, for each respective pattern of the inclusive regex patterns, a respective count of matches between the respective pattern and sets of tokens in a corpus; selecting a regex pattern of the inclusive regex patterns based on a determination that the selected regex pattern has a greatest count of matches; and storing the selected regex pattern in a record of general regex patterns. 5 . The method of claim 1 , wherein: receiving the feedback message comprises receiving the feedback message from the client-side computing device; the client-side computing device presents, in a user interface, the candidate token, a highlight of the candidate token, and a user interface element; and an interaction with the user interface element causes the client-side computing device to send the feedback message to the client-side computing device. 6 . One or more tangible, non-transitory, machine-readable media storing instructions that, when executed by one or more processors, effectuate operations comprising: obtaining a first set of patterns indicating a data label, wherein each respective pattern of the first set of patterns matches with a respective token of a set of test tokens associated with the data label; associating a set of candidate tokens of a text sequence with the data label by removing a first set of tokens from the text sequence based on a match of the first set of tokens with a token of a second set of patterns and selecting the set of candidate tokens from other tokens of the text sequence based on a match between the set of candidate tokens and a pattern of the second set of patterns; in response to receiving an indicator that the set of candidate tokens is incorrectly associated with the data label, updating a token sequence collection to comprise the set of candidate tokens and a context token, wherein the context token is within a pre-determined range of tokens between the context token and a candidate token of the set of candidate tokens; updating the second set of patterns with new patterns that match the set of candidate tokens and the context token; and removing a first pattern from the second set of patterns based on a determination that the first pattern matches with a token sequence associated with the set of test tokens. 7 . The media of claim 6 , wherein updating the token sequence collection comprises: providing a neural network with the candidate token to determine context position scores for tokens surrounding the candidate token; and selecting the context token based on a determination that a context position score of the context token is greatest. 8 . The media of claim 6 , wherein updating the second set of patterns comprises: determining whether a count of token sequences of the token sequence collection is greater than ten; and in response to a determination that the count of token sequences is greater than ten, updating the second set of patterns. 9 . The media of claim 6 , the operations further comprising: obtaining an image; obtaining an orientation direction from a template; determining whether a distance between a first and second token in the orientation direction satisfies each other based on their orientation; and determining that the first and second tokens are part of the set of candidate tokens. 10 . The media of claim 6 , wherein removing the first pattern of the second set of patterns comprises: determining whether a candidate token that is not labeled with the data label matches with a pattern of the second set of patterns; and in response to a determination that the candidate token that is not labeled with the data label matches with the pattern of the second set of patterns, removing the first pattern of the second set of patterns. 11 . The media of claim 6 , wherein the pre-determined range is less than or equal to ten. 12 . The media of claim 6 , wherein updating the second set of patterns comprises using a machine learning model to generate the new patterns based on the context token, wherein the first pattern comprises the context token, and wherein a second pattern of the new patterns does not comprise the context token. 13 . A system comprising: one or more processors; and memory storing computer program instructions that, when executed by the one or more processors, cause the one or more processors to effectuate operations comprising: generating a first set of patterns indicating a data label, wherein each pattern of the first set of patterns selects each token of a set of test tokens associated with the data label; associating a candidate token of a text sequence with the data label by removing a first set of tokens from the text sequence based on a match of the first set of tokens with a token of a second set of patterns and selecting the candidate token from other tokens of the text sequence based o

Assignees

Inventors

Classifications

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Neural networks · CPC title

  • Phrasal analysis, e.g. finite state techniques or chunking · CPC title

  • Machine learning · CPC title

  • Protecting personal data, e.g. for financial or medical purposes · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023418971A1 cover?
A method includes generating first patterns indicating a data label and associating a candidate token of a text sequence with the data label by removing first tokens from the text sequence based on a match of the first tokens with a token of second patterns and selecting the candidate token from other tokens of the text sequence based on a match between the candidate token and a token of the se…
Who is the assignee on this patent?
Capital One Services Llc
What technology area does this patent fall under?
Primary CPC classification G06F21/6245. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 28 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).