Methods and devices for generating sensitive text detectors
US-2023214591-A1 · Jul 6, 2023 · US
US2023418971A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2023418971-A1 |
| Application number | US-202217849439-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jun 24, 2022 |
| Priority date | Jun 24, 2022 |
| Publication date | Dec 28, 2023 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method includes generating first patterns indicating a data label and associating a candidate token of a text sequence with the data label by removing first tokens from the text sequence based on a match of the first tokens with a token of second patterns and selecting the candidate token from other tokens of the text sequence based on a match between the candidate token and a token of the second patterns. The method also includes updating a token sequence collection to comprise the candidate token and a context token, updating the second patterns with new patterns that match the candidate token and the context token, and removing a first pattern from the second patterns based on a determination that the first pattern matches with a token sequence associated with the test tokens.
Opening claim text (preview).
What is claimed is: 1 . A method for updating a rules-based model for detecting sensitive data by updating and pruning regex patterns comprising: generating inclusive regex patterns based on sensitive tokens associated with a sensitive data label, wherein each pattern of the inclusive regex patterns matches with at least one token of the sensitive tokens; associating a candidate token of a document with the sensitive data label by removing false positive tokens with exclusive regex patterns from the document and selecting the candidate token from other tokens of the document based on a match between the candidate token and an inclusive regex pattern; in response to receiving a feedback message from a client-side computing device indicating that the candidate token is incorrectly associated with the sensitive data label, updating a token sequence collection to comprise an additional sequence, the additional sequence comprising the candidate token and context tokens surrounding the candidate token; updating the exclusive regex patterns with new patterns that match the additional sequence, wherein, for each respective pattern of the new patterns, a count of sequences of the token sequence collection that matches with the respective pattern is greater than a threshold; and pruning the exclusive regex patterns by removing an exclusive pattern of the exclusive regex patterns based on a determination that the exclusive pattern matches with a token sequence associated with the sensitive tokens. 2 . The method of claim 1 , wherein: the inclusive regex pattern is a first regex pattern; a second inclusive regex pattern of the inclusive regex patterns matches with the token sequence; the exclusive pattern is associated with an exclusive accuracy score indicating a count of matches between the exclusive pattern and any token sequence indicated to be associated with the sensitive data label by a first set of feedback messages; the inclusive regex pattern is associated with an inclusive accuracy score indicating a count of matches between the second inclusive regex pattern and any token sequence indicated as not associated with the sensitive data label by a second set of feedback messages; and removing the exclusive pattern comprises: determining a confidence score based on the exclusive accuracy score and the inclusive accuracy score; and in response to a determination that the confidence score satisfies a confidence threshold, removing the exclusive pattern. 3 . The method of claim 1 , wherein generating the inclusive regex patterns comprises: generating a set of vectors based on the candidate token; and providing the set of vectors to a neural network to generate the inclusive regex patterns. 4 . The method of claim 1 , further comprising: determining, for each respective pattern of the inclusive regex patterns, a respective count of matches between the respective pattern and sets of tokens in a corpus; selecting a regex pattern of the inclusive regex patterns based on a determination that the selected regex pattern has a greatest count of matches; and storing the selected regex pattern in a record of general regex patterns. 5 . The method of claim 1 , wherein: receiving the feedback message comprises receiving the feedback message from the client-side computing device; the client-side computing device presents, in a user interface, the candidate token, a highlight of the candidate token, and a user interface element; and an interaction with the user interface element causes the client-side computing device to send the feedback message to the client-side computing device. 6 . One or more tangible, non-transitory, machine-readable media storing instructions that, when executed by one or more processors, effectuate operations comprising: obtaining a first set of patterns indicating a data label, wherein each respective pattern of the first set of patterns matches with a respective token of a set of test tokens associated with the data label; associating a set of candidate tokens of a text sequence with the data label by removing a first set of tokens from the text sequence based on a match of the first set of tokens with a token of a second set of patterns and selecting the set of candidate tokens from other tokens of the text sequence based on a match between the set of candidate tokens and a pattern of the second set of patterns; in response to receiving an indicator that the set of candidate tokens is incorrectly associated with the data label, updating a token sequence collection to comprise the set of candidate tokens and a context token, wherein the context token is within a pre-determined range of tokens between the context token and a candidate token of the set of candidate tokens; updating the second set of patterns with new patterns that match the set of candidate tokens and the context token; and removing a first pattern from the second set of patterns based on a determination that the first pattern matches with a token sequence associated with the set of test tokens. 7 . The media of claim 6 , wherein updating the token sequence collection comprises: providing a neural network with the candidate token to determine context position scores for tokens surrounding the candidate token; and selecting the context token based on a determination that a context position score of the context token is greatest. 8 . The media of claim 6 , wherein updating the second set of patterns comprises: determining whether a count of token sequences of the token sequence collection is greater than ten; and in response to a determination that the count of token sequences is greater than ten, updating the second set of patterns. 9 . The media of claim 6 , the operations further comprising: obtaining an image; obtaining an orientation direction from a template; determining whether a distance between a first and second token in the orientation direction satisfies each other based on their orientation; and determining that the first and second tokens are part of the set of candidate tokens. 10 . The media of claim 6 , wherein removing the first pattern of the second set of patterns comprises: determining whether a candidate token that is not labeled with the data label matches with a pattern of the second set of patterns; and in response to a determination that the candidate token that is not labeled with the data label matches with the pattern of the second set of patterns, removing the first pattern of the second set of patterns. 11 . The media of claim 6 , wherein the pre-determined range is less than or equal to ten. 12 . The media of claim 6 , wherein updating the second set of patterns comprises using a machine learning model to generate the new patterns based on the context token, wherein the first pattern comprises the context token, and wherein a second pattern of the new patterns does not comprise the context token. 13 . A system comprising: one or more processors; and memory storing computer program instructions that, when executed by the one or more processors, cause the one or more processors to effectuate operations comprising: generating a first set of patterns indicating a data label, wherein each pattern of the first set of patterns selects each token of a set of test tokens associated with the data label; associating a candidate token of a text sequence with the data label by removing a first set of tokens from the text sequence based on a match of the first set of tokens with a token of a second set of patterns and selecting the candidate token from other tokens of the text sequence based o
Lexical analysis, e.g. tokenisation or collocates · CPC title
Neural networks · CPC title
Phrasal analysis, e.g. finite state techniques or chunking · CPC title
Machine learning · CPC title
Protecting personal data, e.g. for financial or medical purposes · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.