Generation of training data from redacted information

US12468778B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12468778-B2
Application numberUS-202017119247-A
CountryUS
Kind codeB2
Filing dateDec 11, 2020
Priority dateDec 11, 2020
Publication dateNov 11, 2025
Grant dateNov 11, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

One embodiment provides a computer implemented method, including: obtaining an information document corresponding to an entity, wherein the information document includes redacted information spans; identifying an entity type for each of the redacted information spans, wherein the entity type identifies a relationship between a redacted information span and at least one other entity within the information document; replacing the redacted information spans with replacement entities corresponding to the entity type of a given redacted information span, wherein the replacing is performed in view of a frequency distribution of actual information and wherein the replacing includes maintaining relationships of the redacted information spans; and controlling bias within the replacement entities, wherein the controlling includes detecting bias within the replacement entities.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer implemented method, comprising: obtaining an information document corresponding to an entity, wherein the information document comprises redacted information spans which redact sensitive or personal information in the information document; identifying an entity type for each of the redacted information spans, the entity type related to an entire category of the information span, identifying the entity type by generating, using a neural network, at least three vectors for a given of the redacted information spans, one vector representing a left context window, one vector representing an information span, and one vector representing a right context window, the three vectors represented as a token and classified based upon context of the token; replacing the redacted information spans with replacement entities corresponding to the entity type of a given redacted information span, wherein the replacing is performed in view of a frequency distribution of actual information in context with other replacement entities within the information document and wherein the replacing comprises maintaining relationships of the redacted information spans through the replacement entities by utilizing at least one language model to predict entity masks which maintain a context of the information document and relationships between information spans and introducing constraints on output of the language model to conform to the frequency distribution; and generating, from at least the information document having the replaced redacted information spans, a training dataset used to train a machine-learning model. 2 . The computer implemented method of claim 1 , wherein the obtaining comprises obtaining a publicly available information document, converting the publicly available information document into a text-only information document, and redacting personal information included in the text-only information document. 3 . The computer implemented method of claim 2 , wherein the redacting comprises at least one of: a removal of the personal information and replacing the personal information. 4 . The computer implemented method of claim 2 , wherein the identifying an entity type comprises performing, using a neural network, fine-grained classification on the personal information before redaction. 5 . The computer implemented method of claim 1 , wherein the performing a classification on the token comprises performing a coarse grained classification. 6 . The computer implemented method of claim 5 , wherein the identifying an accuracy of the classifying comprises determining an accuracy of the coarse-grained classification. 7 . The computer implemented method of claim 1 , wherein the replacing comprises selecting a replacement entity for a redacted information span that maintains a context of surrounding text. 8 . The computer implemented method of claim 1 , wherein the replacing comprises predicting, using at least one language model and in view of constraints, an entity mask for each of the redacted information spans. 9 . The computer implemented method of claim 1 , wherein the detecting comprises evaluating the information document comprising the replacement entities, wherein the converting comprises converting unstructured data of the evaluated information document comprising the replacement entities into a structured dataset, and wherein the utilizing a bias detector comprises performing, using a structured data bias detector, bias detection on the structured dataset. 10 . An apparatus, comprising: at least one processor; and a computer readable storage medium having computer readable program code embodied therewith and executable by the at least one processor: wherein the computer readable program code is configured to obtain an information document corresponding to an entity, wherein the information document comprises redacted information spans which redact sensitive or personal information in the information document; wherein the computer readable program code is configured to identify an entity type for each of the redacted information spans, the entity type related to an entire category of the information span, the entity type identified by generating, using a neural network, at least one three vectors for a given of the redacted information spans, one vector representing a left context window, one vector representing an information span, and one vector representing a right context window, the three vectors represented as a token and classified based upon context of the token; wherein the computer readable program code is configured to replace the redacted information spans with replacement entities corresponding to the entity type of a given redacted information span, wherein the replacing is performed in view of a frequency distribution of actual information in context with other replacement entities within the information document and wherein the replacing comprises utilizing replacement entities that maintain relationships of the redacted information spans through the replacement entities by utilizing at least one language model to predict entity masks which maintain a context of the information document and relationships between information spans and introducing constraints on output of the language model to conform to the frequency distribution; and wherein the computer readable program code is configured to generate, from at least the information document having the replaced redacted information spans, a training dataset used to train a machine-learning model. 11 . A computer program product, comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code executable by a processor: wherein the computer readable program code is configured to obtain an information document corresponding to an entity, wherein the information document comprises redacted information spans which redacts sensitive or personal information in the information document; wherein the computer readable program code is configured to identify an entity type for each of the redacted information spans, the entity type related to an entire category of the information span, the entity type identified by generating, using a neural network, at least three vectors for a given of the redacted information spans, one vector representing a left context window, one vector representing an information span, and one vector representing a right context window, the three vectors represented as a token and classified based upon context of the token; wherein the computer readable program code is configured to replace the redacted information spans with replacement entities corresponding to the entity type of a given redacted information span, wherein the replacing is performed in view of a frequency distribution of actual information in context with other replacement entities within the information document and wherein the replacing comprises maintaining relationships of the redacted information spans through the replacement entities by utilizing at least one language model to predict entity masks which maintain a context of the information document and relationships between information spans and introducing constraints on output of the language model to conform to the frequency distribution; and wherein the computer readable program code is configured to generate, from at least the information document having the replaced redacted information spans, a training dataset used to train a machine-learning model. 12 . The computer program product of claim 11 , wherein the obtaining comprises obtaining a publicly ava

Assignees

Inventors

Classifications

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • by anonymising data, e.g. decorrelating personal data from the owner's identification · CPC title

  • Neural networks · CPC title

  • G06F40/166Primary

    Editing, e.g. inserting or deleting · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12468778B2 cover?
One embodiment provides a computer implemented method, including: obtaining an information document corresponding to an entity, wherein the information document includes redacted information spans; identifying an entity type for each of the redacted information spans, wherein the entity type identifies a relationship between a redacted information span and at least one other entity within the i…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/166. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 11 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).