Removing personal information from text using multiple levels of redaction
US-2021073461-A1 · Mar 11, 2021 · US
US12468778B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12468778-B2 |
| Application number | US-202017119247-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 11, 2020 |
| Priority date | Dec 11, 2020 |
| Publication date | Nov 11, 2025 |
| Grant date | Nov 11, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
One embodiment provides a computer implemented method, including: obtaining an information document corresponding to an entity, wherein the information document includes redacted information spans; identifying an entity type for each of the redacted information spans, wherein the entity type identifies a relationship between a redacted information span and at least one other entity within the information document; replacing the redacted information spans with replacement entities corresponding to the entity type of a given redacted information span, wherein the replacing is performed in view of a frequency distribution of actual information and wherein the replacing includes maintaining relationships of the redacted information spans; and controlling bias within the replacement entities, wherein the controlling includes detecting bias within the replacement entities.
Opening claim text (preview).
What is claimed is: 1 . A computer implemented method, comprising: obtaining an information document corresponding to an entity, wherein the information document comprises redacted information spans which redact sensitive or personal information in the information document; identifying an entity type for each of the redacted information spans, the entity type related to an entire category of the information span, identifying the entity type by generating, using a neural network, at least three vectors for a given of the redacted information spans, one vector representing a left context window, one vector representing an information span, and one vector representing a right context window, the three vectors represented as a token and classified based upon context of the token; replacing the redacted information spans with replacement entities corresponding to the entity type of a given redacted information span, wherein the replacing is performed in view of a frequency distribution of actual information in context with other replacement entities within the information document and wherein the replacing comprises maintaining relationships of the redacted information spans through the replacement entities by utilizing at least one language model to predict entity masks which maintain a context of the information document and relationships between information spans and introducing constraints on output of the language model to conform to the frequency distribution; and generating, from at least the information document having the replaced redacted information spans, a training dataset used to train a machine-learning model. 2 . The computer implemented method of claim 1 , wherein the obtaining comprises obtaining a publicly available information document, converting the publicly available information document into a text-only information document, and redacting personal information included in the text-only information document. 3 . The computer implemented method of claim 2 , wherein the redacting comprises at least one of: a removal of the personal information and replacing the personal information. 4 . The computer implemented method of claim 2 , wherein the identifying an entity type comprises performing, using a neural network, fine-grained classification on the personal information before redaction. 5 . The computer implemented method of claim 1 , wherein the performing a classification on the token comprises performing a coarse grained classification. 6 . The computer implemented method of claim 5 , wherein the identifying an accuracy of the classifying comprises determining an accuracy of the coarse-grained classification. 7 . The computer implemented method of claim 1 , wherein the replacing comprises selecting a replacement entity for a redacted information span that maintains a context of surrounding text. 8 . The computer implemented method of claim 1 , wherein the replacing comprises predicting, using at least one language model and in view of constraints, an entity mask for each of the redacted information spans. 9 . The computer implemented method of claim 1 , wherein the detecting comprises evaluating the information document comprising the replacement entities, wherein the converting comprises converting unstructured data of the evaluated information document comprising the replacement entities into a structured dataset, and wherein the utilizing a bias detector comprises performing, using a structured data bias detector, bias detection on the structured dataset. 10 . An apparatus, comprising: at least one processor; and a computer readable storage medium having computer readable program code embodied therewith and executable by the at least one processor: wherein the computer readable program code is configured to obtain an information document corresponding to an entity, wherein the information document comprises redacted information spans which redact sensitive or personal information in the information document; wherein the computer readable program code is configured to identify an entity type for each of the redacted information spans, the entity type related to an entire category of the information span, the entity type identified by generating, using a neural network, at least one three vectors for a given of the redacted information spans, one vector representing a left context window, one vector representing an information span, and one vector representing a right context window, the three vectors represented as a token and classified based upon context of the token; wherein the computer readable program code is configured to replace the redacted information spans with replacement entities corresponding to the entity type of a given redacted information span, wherein the replacing is performed in view of a frequency distribution of actual information in context with other replacement entities within the information document and wherein the replacing comprises utilizing replacement entities that maintain relationships of the redacted information spans through the replacement entities by utilizing at least one language model to predict entity masks which maintain a context of the information document and relationships between information spans and introducing constraints on output of the language model to conform to the frequency distribution; and wherein the computer readable program code is configured to generate, from at least the information document having the replaced redacted information spans, a training dataset used to train a machine-learning model. 11 . A computer program product, comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code executable by a processor: wherein the computer readable program code is configured to obtain an information document corresponding to an entity, wherein the information document comprises redacted information spans which redacts sensitive or personal information in the information document; wherein the computer readable program code is configured to identify an entity type for each of the redacted information spans, the entity type related to an entire category of the information span, the entity type identified by generating, using a neural network, at least three vectors for a given of the redacted information spans, one vector representing a left context window, one vector representing an information span, and one vector representing a right context window, the three vectors represented as a token and classified based upon context of the token; wherein the computer readable program code is configured to replace the redacted information spans with replacement entities corresponding to the entity type of a given redacted information span, wherein the replacing is performed in view of a frequency distribution of actual information in context with other replacement entities within the information document and wherein the replacing comprises maintaining relationships of the redacted information spans through the replacement entities by utilizing at least one language model to predict entity masks which maintain a context of the information document and relationships between information spans and introducing constraints on output of the language model to conform to the frequency distribution; and wherein the computer readable program code is configured to generate, from at least the information document having the replaced redacted information spans, a training dataset used to train a machine-learning model. 12 . The computer program product of claim 11 , wherein the obtaining comprises obtaining a publicly ava
Lexical analysis, e.g. tokenisation or collocates · CPC title
by anonymising data, e.g. decorrelating personal data from the owner's identification · CPC title
Neural networks · CPC title
Editing, e.g. inserting or deleting · CPC title
Supervised learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.