Identifying section headings in a document
US-2020320170-A1 · Oct 8, 2020 · US
US12277389B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12277389-B2 |
| Application number | US-202117315447-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 10, 2021 |
| Priority date | May 10, 2021 |
| Publication date | Apr 15, 2025 |
| Grant date | Apr 15, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Frequent sequences extracted from a set of documents according to a common rule are obtained. Based on comparing occurrence frequencies of various sequences, confidence of the first frequent sequence being a label expression representing a document part in a target document is evaluated. Keywords are extracted from the target document based on evaluation of the confidence.
Opening claim text (preview).
What is claimed is: 1. A method for mining text by a computer-based text mining system, comprising: obtaining, by the text mining system, a first frequent sequence of characters from a set of documents, the set of documents having structured contents according to a common rule, wherein the first frequent sequence satisfies a condition of maximality and the satisfying of the condition of maximality comprises: performing a first comparison, the first comparison comprising comparing a first occurrence frequency of the first frequent sequence to a second occurrence frequency of a second sequence, wherein the second sequence is longer than the first sequence and the second sequence contains the first sequence; determining that the first frequent sequence includes a symbol and the symbol comprises formatting data for a target document; decomposing, responsive to the determining the first frequent sequence includes a symbol, the first frequent sequence into a symbol part and a remaining part; evaluating, by the text mining system and based on the comparing, a first confidence of the first frequent sequence being a label expression, wherein the label expression represents a document part in the target document, the evaluating the first confidence comprises: calculating a primary confidence value for the first frequent sequence across the set of documents; computing a likelihood of the symbol being contained in the first frequent sequence observed in the target document; and adjusting the primary confidence value, resulting in a secondary confidence value for the first frequent sequence within the target document, wherein the adjusting the primary confidence value for the first frequent sequence to obtain the secondary confidence value is based on the likelihood of the symbol being contained in the first frequent sequence; determining that the first confidence is above a confidence threshold; extracting, in response to the determining the first confidence is above the confidence threshold and by the text mining system, one or more keywords from the target document based on the secondary confidence value of the first frequent sequence, wherein the extracting the one or more keywords further comprises: applying keyword extraction to the target document, resulting in a set of keywords; identifying that a first keyword included in the set of keywords overlaps with the first frequent sequence; removing, based on the determining and on the identifying, the first keyword from the set of keywords; and assigning a label relating to the first frequent sequence to a second keyword included in the set of keywords, the assigning based on positions in the target document where the first frequent sequence and the second keyword have appeared; and outputting the one or more keywords and the secondary confidence value. 2. The method of claim 1 , wherein the obtaining the first frequent sequence further comprises filtering the first frequent sequence based on a filtering condition with respect to at least a number of documents containing the first frequent sequence. 3. The method of claim 1 , wherein the obtaining the first frequent sequence further comprises enumerating an additional sequence based on a predetermined enumeration rule. 4. The method of claim 1 , wherein obtaining a first frequent sequence further comprises: concatenating characters of each of the set of documents into a character array; enumerating, based on the character array, each of a set of enumerated sequences observed in the set of documents with an occurrence frequency of each enumerated sequence observed in the set of documents; performing a second comparison, the second comparison including comparing a third occurrence frequency of a first enumerated sequence from the set of enumerated sequences to a fourth occurrence frequency of a longer enumerated sequence containing the first enumerated sequence; and designating, based on the second comparison, the first enumerated sequence as the first frequent sequence. 5. The method of claim 1 , wherein the first confidence is evaluated based on: the first occurrence frequency of the first frequent sequence; a fifth occurrence frequency of a substring of the first frequent sequence; a sixth occurrence frequency of a longer sequence containing the first frequent sequence; and the first confidence is calculated by: conf( s )=conf 1 ( s )·conf 2 ( s ) where: c o n f 1 ( s ) = freq ( s ) / freq ( x ) and conf 2 ( s ) = ( 1 - freq ( y ) b · freq ( s ) ) where conf(s) is the first confidence, the freq(s) is the first occurrence frequency, the freq (x) is the fifth occurrence frequency, the freq (y) is the sixth occurrence frequency, and the b is a weighting factor. 6. The method of claim 1 , wherein: each document is written in a natural language; and each frequent sequence is a frequently occurring character sequence. 7. The method of claim 1 , wherein the secondary confidence is increased in response to determining the symbol is included in the first frequent sequence in a target document. 8. The method of claim 1 , wherein the keyword comprises a key, a value, and a category label, the category label is associated with the first frequent sequence, and the category label is the remaining part of the first frequent sequence. 9. A computer based text mining system, comprising: a memory; and a processor coupled to the memory, the processor configured to cause the text mining system to: obtain, by the text mining system, a first frequent sequence of characters from a set of documents, the set of documents having structured contents according to a common rule, wherein the first frequent sequence satisfies a condition of maximality and satisfying the condition of maximality comprises: performing a first comparison, the first comparison including comparing a first occurrence frequency of the first frequent sequence to a
Document management systems · CPC title
Semantic analysis · CPC title
Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title
Recognition of textual entities · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.