Method and system for accurately detecting, extracting and representing redacted text blocks in a document

US10733434B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10733434-B2
Application numberUS-201816139884-A
CountryUS
Kind codeB2
Filing dateSep 24, 2018
Priority dateSep 24, 2018
Publication dateAug 4, 2020
Grant dateAug 4, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method, system and a computer program product are provided for automatically detecting redaction blocks in an image file document by analyzing the document to identify any redaction block areas and then detecting location information for each redaction block area identified in the document which may be mapped to any associated text fragments in the document based on the location information for each redaction block area and text fragment in the document.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for automatically detecting redaction blocks in a document comprising: receiving, by an information handling system comprising a processor and a memory, the document as an image file; analyzing, by the information handling system, the document to identify any redaction block areas in the document; detecting, by the information handling system, location information for each redaction block area identified in the document; applying, by the information handling system, optical character recognition to the document to detect text fragments in the document; detecting, by the information handling system, location information for each text fragment identified in the document; and mapping, by the information handling system, each redaction block area to any associated text fragments in the document based on the location information for each redaction block area and text fragment in the document, wherein the redaction block areas are redacted block areas. 2. The method of claim 1 , further comprising classifying, by the information handling system, each identified redaction block area as a type Ti selected from a group consisting of a text block, a table cell, a checkbox, and unknown. 3. The method of claim 2 , where the type Ti is the checkbox which classifies a redaction block area located over a group of labels in the document. 4. The method of claim 1 , where each redaction block area is a blacked-out area in the document. 5. The method of claim 1 , where detecting location information comprises computing, by the information handling system, a geometric shape and x, y coordinates for each redaction block area. 6. The method of claim 1 , further comprising inserting, by the information handling system, a sentinel string of predetermined characters into the document for each detected redaction block area. 7. The method of claim 1 , where analyzing the document comprises applying a redaction block detection process to scan each line of the image file to identify any redacted text blocks by locating a threshold number T 1 of consecutive black pixels that are aligned in a threshold number T 1 of consecutive rows. 8. An information handling system comprising: one or more processors; a memory coupled to at least one of the processors; a set of instructions stored in the memory and executed by at least one of the processors to automatically detect redaction blocks in a document, wherein the set of instructions are executable to perform actions of: receiving, by the system, the document as an image file; analyzing, by the system, the document to identify any redaction block areas in the document; detecting, by the system, location information for each redaction block area identified in the document; applying, by the system, optical character recognition to the document to detect text fragments in the document; detecting, by the system, location information for each text fragment identified in the document; and mapping, by the system, each redaction block area to any associated text fragments in the document based on the location information for each redaction block area and text fragment in the document, wherein the redaction block areas are redacted block areas. 9. The information handling system of claim 8 , where the set of instructions are executable to classify, by the system, each identified redaction block area as a type Ti selected from a group consisting of a text block, a table cell, a checkbox, and unknown. 10. The information handling system of claim 9 , where the type Ti is the checkbox which classifies a redaction block area located over a group of labels in the document. 11. The information handling system of claim 8 , where each redaction block area is a blacked-out area in the document. 12. The information handling system of claim 8 , where the set of instructions are executable to detect location information by computing a geometric shape and x, y coordinates for each redaction block area. 13. The information handling system of claim 8 , where the set of instructions are executable to insert, by the system, a sentinel string of predetermined characters into the document for each detected redaction block area. 14. The information handling system of claim 8 , where the set of instructions are executable to analyze the document by applying a redaction block detection process to scan each line of the image file to identify any redacted text blocks by locating a threshold number T 1 of consecutive black pixels that are aligned in a threshold number T 1 of consecutive rows. 15. A computer program product stored in a computer readable storage medium, comprising computer instructions that, when executed by an information handling system, causes the system to automatically detecting redaction blocks in a document by performing actions comprising: receiving, by the system, the document as an image file; analyzing, by the information handling system, the document to identify any redaction blocks in the document, wherein each redaction block is a redacted block; detecting, by the information handling system, location information for each redaction block identified in the document; applying, by the information handling system, optical character recognition to the document to detect text fragments in the document; detecting, by the information handling system, location information for each text fragment identified in the document; mapping, by the information handling system, each redaction block to any associated text fragments in the document based on the location information for each redaction block and text fragment in the document; classifying, by the system, each identified redaction block as a redaction block type Ti selected from a group consisting of a text block, a table cell, a checkbox, and unknown; and generating, by the system, an output file which identifies, for the document, each text fragment and associated text fragment location information, along with each redaction block and associated redaction block fragment location information and redaction block type Ti. 16. The computer program product of claim 15 , further comprising computer instructions that, when executed by the information handling system, causes the system to insert a sentinel string of predetermined characters into the document for each detected redaction block. 17. The computer program product of claim 15 , where analyzing the document comprises applying a redaction block detection process to scan each line of the image file to identify any redacted text blocks by locating a threshold number T 1 of consecutive black pixels that are aligned in a threshold number T 1 of consecutive rows.

Assignees

Inventors

Classifications

  • Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors · CPC title

  • Classification of content, e.g. text, photographs or tables · CPC title

  • G06V30/412Primary

    Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title

  • Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10733434B2 cover?
A computer-implemented method, system and a computer program product are provided for automatically detecting redaction blocks in an image file document by analyzing the document to identify any redaction block areas and then detecting location information for each redaction block area identified in the document which may be mapped to any associated text fragments in the document based on the l…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06V30/412. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 04 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).