Text mining based on document structure information extraction

US12277389B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12277389-B2
Application numberUS-202117315447-A
CountryUS
Kind codeB2
Filing dateMay 10, 2021
Priority dateMay 10, 2021
Publication dateApr 15, 2025
Grant dateApr 15, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Frequent sequences extracted from a set of documents according to a common rule are obtained. Based on comparing occurrence frequencies of various sequences, confidence of the first frequent sequence being a label expression representing a document part in a target document is evaluated. Keywords are extracted from the target document based on evaluation of the confidence.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for mining text by a computer-based text mining system, comprising: obtaining, by the text mining system, a first frequent sequence of characters from a set of documents, the set of documents having structured contents according to a common rule, wherein the first frequent sequence satisfies a condition of maximality and the satisfying of the condition of maximality comprises: performing a first comparison, the first comparison comprising comparing a first occurrence frequency of the first frequent sequence to a second occurrence frequency of a second sequence, wherein the second sequence is longer than the first sequence and the second sequence contains the first sequence; determining that the first frequent sequence includes a symbol and the symbol comprises formatting data for a target document; decomposing, responsive to the determining the first frequent sequence includes a symbol, the first frequent sequence into a symbol part and a remaining part; evaluating, by the text mining system and based on the comparing, a first confidence of the first frequent sequence being a label expression, wherein the label expression represents a document part in the target document, the evaluating the first confidence comprises: calculating a primary confidence value for the first frequent sequence across the set of documents; computing a likelihood of the symbol being contained in the first frequent sequence observed in the target document; and adjusting the primary confidence value, resulting in a secondary confidence value for the first frequent sequence within the target document, wherein the adjusting the primary confidence value for the first frequent sequence to obtain the secondary confidence value is based on the likelihood of the symbol being contained in the first frequent sequence; determining that the first confidence is above a confidence threshold; extracting, in response to the determining the first confidence is above the confidence threshold and by the text mining system, one or more keywords from the target document based on the secondary confidence value of the first frequent sequence, wherein the extracting the one or more keywords further comprises: applying keyword extraction to the target document, resulting in a set of keywords; identifying that a first keyword included in the set of keywords overlaps with the first frequent sequence; removing, based on the determining and on the identifying, the first keyword from the set of keywords; and assigning a label relating to the first frequent sequence to a second keyword included in the set of keywords, the assigning based on positions in the target document where the first frequent sequence and the second keyword have appeared; and outputting the one or more keywords and the secondary confidence value. 2. The method of claim 1 , wherein the obtaining the first frequent sequence further comprises filtering the first frequent sequence based on a filtering condition with respect to at least a number of documents containing the first frequent sequence. 3. The method of claim 1 , wherein the obtaining the first frequent sequence further comprises enumerating an additional sequence based on a predetermined enumeration rule. 4. The method of claim 1 , wherein obtaining a first frequent sequence further comprises: concatenating characters of each of the set of documents into a character array; enumerating, based on the character array, each of a set of enumerated sequences observed in the set of documents with an occurrence frequency of each enumerated sequence observed in the set of documents; performing a second comparison, the second comparison including comparing a third occurrence frequency of a first enumerated sequence from the set of enumerated sequences to a fourth occurrence frequency of a longer enumerated sequence containing the first enumerated sequence; and designating, based on the second comparison, the first enumerated sequence as the first frequent sequence. 5. The method of claim 1 , wherein the first confidence is evaluated based on: the first occurrence frequency of the first frequent sequence; a fifth occurrence frequency of a substring of the first frequent sequence; a sixth occurrence frequency of a longer sequence containing the first frequent sequence; and the first confidence is calculated by: conf( s )=conf 1 ( s )·conf 2 ( s ) where: c ⁢ o ⁢ n ⁢ f 1 ( s ) = freq ⁡ ( s ) / freq ⁡ ( x ) ⁢ and ⁢ conf 2 ( s ) = ( 1 - freq ⁡ ( y ) b · freq ⁡ ( s ) ) where conf(s) is the first confidence, the freq(s) is the first occurrence frequency, the freq (x) is the fifth occurrence frequency, the freq (y) is the sixth occurrence frequency, and the b is a weighting factor. 6. The method of claim 1 , wherein: each document is written in a natural language; and each frequent sequence is a frequently occurring character sequence. 7. The method of claim 1 , wherein the secondary confidence is increased in response to determining the symbol is included in the first frequent sequence in a target document. 8. The method of claim 1 , wherein the keyword comprises a key, a value, and a category label, the category label is associated with the first frequent sequence, and the category label is the remaining part of the first frequent sequence. 9. A computer based text mining system, comprising: a memory; and a processor coupled to the memory, the processor configured to cause the text mining system to: obtain, by the text mining system, a first frequent sequence of characters from a set of documents, the set of documents having structured contents according to a common rule, wherein the first frequent sequence satisfies a condition of maximality and satisfying the condition of maximality comprises: performing a first comparison, the first comparison including comparing a first occurrence frequency of the first frequent sequence to a

Assignees

Inventors

Classifications

  • Document management systems · CPC title

  • Semantic analysis · CPC title

  • Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title

  • G06F40/279Primary

    Recognition of textual entities · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12277389B2 cover?
Frequent sequences extracted from a set of documents according to a common rule are obtained. Based on comparing occurrence frequencies of various sequences, confidence of the first frequent sequence being a label expression representing a document part in a target document is evaluated. Keywords are extracted from the target document based on evaluation of the confidence.
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/279. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 15 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).