Methods and devices for generating sensitive text detectors

US2023214591A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023214591-A1
Application numberUS-202117565770-A
CountryUS
Kind codeA1
Filing dateDec 30, 2021
Priority dateDec 30, 2021
Publication dateJul 6, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to one or more processors, communicative with one or more computer-readable media, are configured to automatically generate a sensitive text detector including a regular expression or keyword. A set of text inputs, including sensitive text, are received. The sensitive text is extracted from the set of text inputs. Based on the extracted sensitive text, one or both of the regular expression and the keyword are generated. The generated regular expression and/or keyword are used to generate a sensitive text detector for sensitive text detection.

First claim

Opening claim text (preview).

1 . A computer-implemented method of generating a sensitive text detector, comprising: receiving a set of text inputs comprising sensitive text; extracting the sensitive text from the set of text inputs; and generating, based on the extracted sensitive text, one or more of: a regular expression; and a keyword; and generating the sensitive text detector based on the generated one or more of the regular expression and the keyword. 2 . The method of claim 1 , wherein generating the regular expression comprises: converting the extracted sensitive text into one or more regular expressions; generating a population comprising the one or more regular expressions; evolving the population by: transforming at least one of the one or more regular expressions; adding the at least one transformed regular expression to the population; and determining a fitness score for each regular expression in the population; iterating the evolution of the population until a predetermined condition is met; and after iterating the evolution of the population, generating, based on each fitness score, the regular expression. 3 . The method of claim 2 , wherein determining the fitness score comprises: for each of multiple training samples in a training set of training samples, each training sample either comprising sensitive text or not comprising sensitive text: identifying text within the training sample based on the regular expression; and determining the fitness score based on the identified text. 4 . The method of claim 3 , wherein determining the fitness score based on the extracted text comprises: determining one or more of: for each training sample comprising sensitive text, a degree of similarity between the identified text and the sensitive text; and for each training sample not comprising sensitive text, an amount of the identified text. 5 . The method of claim 3 , wherein the fitness score satisfies the following formula: ƒ(r) = ƒ s (r) + ƒ char (r) + L score (r), wherein f s (r) is based on a degree of similarity between all of the identified text and the sensitive text, f char (r) is based on a degree of similarity between a portion of the identified text and the sensitive text, and L score is based on a length of the identified text relative to a length of the sensitive text. 6 . The method of claim 2 , wherein generating the regular expression comprises: determining that, among the fitness scores, at least one of the fitness scores is in a steady state; and in response to determining that the at least one of the fitness scores is in the steady state, generating the regular expression associated with the at least one of the fitness scores in the steady state. 7 . The method of claim 2 , wherein transforming at least one of the one or more regular expressions comprises one or more of: randomly modifying a portion of the at least one of the one or more regular expressions; and exchanging a portion of a first one of the one or more regular expressions with a portion of a second one of the one or more regular expressions. 8 . The method of claim 1 , further comprising: inputting text to the sensitive text detector, wherein the inputted text comprises sensitive text associated with one or more regular expressions; and using the sensitive text detector to extract the sensitive text from the inputted text, based on the one or more regular expressions corresponding to the generated regular expression. 9 . The method of claim 1 , wherein: extracting the sensitive text from the set of text inputs comprises: extracting a range of text based on a location of the sensitive text within the set of text inputs; and the generating comprises: identifying one or more candidate keywords within the range of text; and generating, based on the one or more candidate keywords, the keyword. 10 . The method of claim 9 , wherein extracting the range of text comprises: extracting a combination of the sensitive text and text that is one or more of: a preset distance before the sensitive text; and a preset distance after the sensitive text. 11 . The method of claim 9 , wherein identifying the one or more candidate keywords comprises: filtering the range of text by removing one or more words from the range of text; and identifying the one or more candidate keywords within the filtered range of text. 12 . The method of claim 11 , wherein filtering the range of text comprises: comparing each word in the range of text to each of multiple words in a list of stop words; and based on the comparison, removing from the range of text any word contained in the list of stop words. 13 . The method of claim 9 , wherein generating, based on the one or more candidate keywords, the keyword comprises: calculating one or more of: a co-occurrence in the range of text of each candidate keyword with at least one other candidate keyword; and a number of instances of each candidate keyword in the range of text. 14 . The method of claim 1 , further comprising: inputting text to the sensitive text detector, wherein the inputted text comprises sensitive text; and using the sensitive text detector to extract the sensitive text from the inputted text, based on one or more words in the sensitive text corresponding to the generated keyword. 15 . The method of claim 1 , further comprising: applying a check function the generated regular expression or the generated keyword. 16 . A non-transitory computer-readable medium having stored thereon computer program code configured, when executed by one or more processors, to cause the one or more processors to perform a method comprising: receiving a set of text inputs comprising sensitive text; extracting the sensitive text from the set of text inputs; and generating, based on the extracted sensitive text, one or more of: a regular expression; and a keyword; and generating a sensitive text detector based on the generated one or more of the regular expression and the keyword. 17 . A computing device for generating a sensitive text detector, comprising: one or more processors configured to: receive a set of text inputs comprising sensitive text; extract the sensitive text from the set of text inputs; and generate, based on the extracted sensitive text, one or more of: a regular expression; and a keyword; and generate the sensitive text detector based on the generated one or more of the regular expression and the keyword. 18 . The computing device of claim 17 , wherein: the one or more processors are further configured to: receive text input, wherein the text input comprises sensitive text associated with one or more regular expressions; and use the sensitive text detector to extract the sensitive text from the text input, based on the one or more regular expressions corresponding to the generated regular expression. 19 . The computing device of claim 17 , wherein: the one or more processors are further configured to: receive text input, wherein the inputted text comprises sensitive text; and use the sensitive text detector to extract the sensitive text from the text input, based on one or more words in the sensitive text corresponding to the generated keyword. 20 . The computing device of claim 17 , wherein: the one or more processors are further configured to: convert the extracted sensitive text into one or more regular expressions; generate a population comprising the one or more regular expressi

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023214591A1 cover?
The present disclosure relates to one or more processors, communicative with one or more computer-readable media, are configured to automatically generate a sensitive text detector including a regular expression or keyword. A set of text inputs, including sensitive text, are received. The sensitive text is extracted from the set of text inputs. Based on the extracted sensitive text, one or both…
Who is the assignee on this patent?
Huawei Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/279. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jul 06 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).