Smart identification of indicator text with full-text search or optimized document analysis

US12417306B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12417306-B2
Application numberUS-202218068022-A
CountryUS
Kind codeB2
Filing dateDec 19, 2022
Priority dateDec 19, 2022
Publication dateSep 16, 2025
Grant dateSep 16, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Several aspects for optimizing unstructured document analysis comprise operating a document system, where the document system comprises a plurality of documents comprising unstructured content and a full-text index; receiving a request to identify documents comprising a type of data elements; selecting a sample out of the plurality of documents; determining data elements of the type in the sample of documents; determining an indicator context expression for the type of data elements out of the determined data elements of the type; determining a query for searching, using a search engine, the full-text index using the indicator context expression; and determining the documents in the document system being compliant to the determined query.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for optimizing unstructured document analysis, said method comprising: operating a document system, said document system comprising a plurality of documents comprising unstructured content and a full-text index; receiving a request to identify documents comprising a type of data elements; selecting a sample out of said plurality of documents, wherein selecting the sample comprises an existing sampling approach to identify a representative, small subset of the large number of documents in a scope of the request; determining data elements of said type in said sample of said plurality of documents; determining an indicator context expression for said type of data elements out of the determined data elements of said type; determining a query for searching, using a search engine, said full-text index using said indicator context expression; and determining the documents in said document system being compliant to said query. 2. The method according to claim 1 , wherein a number of documents in said sample is at least 10 times smaller than a second number of documents in said document system. 3. The method according to claim 1 , wherein said determining the data elements of said type in said sample comprises: determining a number of relevant sample documents in said sample; and upon determining that said number of relevant sample documents is below a predefined sample threshold value, selecting a larger sample out of said plurality of documents. 4. The method according to claim 1 , wherein said determining the documents in said document system being compliant to said query comprises: applying a full analysis related to the documents in said document system. 5. The method according to claim 1 , further comprising: determining a result number of the documents being compliant with said query; and upon said result number being determined to be equal or outside predefined boundaries, adjusting said query and repeating said determining the documents in said document system being compliant to said query. 6. The method according to claim 5 , further comprising: upon said result number having a value within predefined boundaries and a quality indicator value being larger than a predefined quality indicator threshold value, wherein said quality indicator value being indicative of a quality criterion of said type of data elements, stopping the repeating. 7. The method according to claim 1 , further comprising: repeating said steps of: determining indicator context expressions, determining the query for searching said full-text index, and determining the documents in said document system being compliant to said query, thereby redefining a scope of said indicator context expression. 8. The method according to claim 1 , wherein said determining said indicator context expression comprises: selecting an expression to a left of a determined data element as one indicator context expression; and selecting another expression to a right of said determined data element as another indicator context expression. 9. The method according to claim 1 , wherein determining said indicator context expression comprises: selecting an expression as said indicator context expression in a surrounding of a determined data element, wherein said expression has another format than other elements in said surrounding of said determined data element. 10. The method according to claim 1 , wherein determining said indicator context expression comprises: using a trained machine-learning model that has been trained to determine said indicator context expression for a determined data element in a given document, wherein said machine-learning model has been developed by a training of a machine-learning system with documents with labelled selected data elements and related indicator context expressions. 11. The method according to claim 1 , wherein determining said indicator context expressions comprises: using an association model adapted for detecting strong relationship patterns between a determined data element and a potential indicator context expression; and confirming said potential indicator context expression as an actual indicator context expression based on an analysis of other documents comprising said relationship of said potential indicator context expression and said determined data element. 12. A computer-implemented document analysis system for optimizing unstructured document analysis, said system comprising: a processor and a memory operatively coupled to said processor, wherein said memory stored program code portions, which, when executed enable said processor to: operate a document system, said document system comprising a plurality of documents comprising unstructured content and a full-text index; receive a request to identify documents comprising a type of data elements; select a sample out of said plurality of documents, wherein selecting the sample comprises an existing sampling approach to identify a representative, small subset of the large number of documents in a scope of the request; determine data elements of said type in said sample of said plurality of documents; determine an indicator context expression for said type of data elements out of the determined data elements of said type; determine a query for searching, using a search engine, said full-text index using said indicator context expression; and determine the documents in said document system being compliant to said query. 13. The system of claim 12 , wherein a number of documents in said sample is at least ten times smaller than a second number of documents in said document system. 14. The system of claim 12 , wherein, during said determining said data elements of said type in said sample of documents, said processor is also adapted to: determine a number of relevant sample documents in said sample; and upon determining that said number of relevant sample documents is below a predefined sample threshold value, selecting a larger sample out of said plurality of documents. 15. The system of claim 12 , wherein during said determining the documents in said document system, said processor is also adapted to: apply a full analysis system related to said document system. 16. The system of claim 12 , wherein said processor is also adapted to: determine a result number of the documents being compliant with said query; and upon a determination that said result number is equal or outside predefined boundaries, adjust said query and execute a repetition of said determining the documents in said document system being compliant to said query. 17. The system according to claim 16 , wherein said processor, upon said result number having a value within predefined boundaries, is also adapted to: upon a quality indicator value being larger than a predefined quality indicator threshold value, wherein said quality indicator value being indicative of a quality criterion of said type of determined data element, stop said repetition. 18. The system according to claim 12 , wherein said processor is also adapted to: repeat said determining indicator context expressions, said determining said query for searching said full-text index, and said determining the documents in said document system being compliant to said query, thereby redefining a scope of said indicator context expressions. 19. The system according to claim 12 , wherein said processor, during said determining said indicator context expression, is also adapted to: selec

Assignees

Inventors

Classifications

  • Indexing; Data structures therefor; Storage structures · CPC title

  • Reformulation based on results of preceding query · CPC title

  • Document management systems · CPC title

  • Protecting personal data, e.g. for financial or medical purposes · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12417306B2 cover?
Several aspects for optimizing unstructured document analysis comprise operating a document system, where the document system comprises a plurality of documents comprising unstructured content and a full-text index; receiving a request to identify documents comprising a type of data elements; selecting a sample out of the plurality of documents; determining data elements of the type in the samp…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F21/6245. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 16 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).