Oilfield data file classification and information processing systems
US-2021233008-A1 · Jul 29, 2021 · US
US12437570B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12437570-B2 |
| Application number | US-202218260526-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 7, 2022 |
| Priority date | Jan 8, 2021 |
| Publication date | Oct 7, 2025 |
| Grant date | Oct 7, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method involves extracting, from a file comprising an unstructured oilfield document, terms, calculating term frequency inverse document frequency (TF-IDF) of the terms to generate an input vector, execute a document content classification model on the input vector to generate a document content classification of unstructured oilfield document, and extract table information from a table in the unstructured oilfield document. The method further involves storing, with the file in storage, the document content classification and the table information.
Opening claim text (preview).
What is claimed is: 1. A method comprising: obtaining, for a plurality of oilfield document content classes, a training set comprising a plurality of documents; calculating an inverse document frequency from the plurality of documents in the training set; calculating term frequency inverse document frequency (TF-IDF) of terms in the training data set to generate a plurality of TF-IDF vector results related to a plurality of document content classes; training the document content type classification model using the plurality of TF-IDF vector results; extracting, from a file comprising an unstructured oilfield document, a plurality of terms; calculating TF-IDF of the plurality of terms to generate an input vector; executing a document content classification model on the input vector to generate a document content classification of unstructured oilfield document; extracting table information from a table in the unstructured oilfield document; and storing, with the file in storage, the document content classification and the table information. 2. The method of claim 1 , wherein the document content classification comprises: a plurality of document content classes each associated with a corresponding probability of the unstructured oilfield document being in the document content class. 3. The method of claim 1 , wherein extracting table information comprises: detecting a table in the unstructured oilfield document; generating a bounding box around the table; detecting a plurality of rows and a plurality of columns of the table using the bounding box; extracting contents from the plurality of rows and the plurality of columns; interrelating the contents in the plurality of rows to obtain related contents; and storing the related contents in a comma separated value file. 4. The method of claim 3 , further comprising: obtaining, from a table control file, a table parameter of the table, wherein the table parameter specifies whether the table comprises a plurality of vertical lines, detecting the plurality of vertical lines in the table based on the table parameter; and wherein detecting the plurality of columns is performed using the plurality of vertical lines. 5. The method of claim 3 , further comprising: obtaining, from a table control file, a table parameter of the table, wherein the table parameter specifies whether the table comprises a plurality of horizontal lines, detecting the plurality of horizontal lines in the table based on the table parameter; and wherein detecting the plurality of columns is performed using the plurality of horizontal lines. 6. The method of claim 1 , further comprising: obtaining a control file comprising: a model specification of the document type classification model, and a data extraction control file path specifying a location to store the document content classification and the table information. 7. The method of claim 1 , further comprising: extracting file metadata of the file; and cataloging the unstructured oilfield document using the file metadata. 8. A system comprising: memory; and a processor for executing computer readable code configured to perform operations comprising: obtaining, for a plurality of oilfield document content classes, a training set comprising a plurality of documents; calculating an inverse document frequency from the plurality of documents in the training set; calculating term frequency inverse document frequency (TF-IDF) of terms in the training data set to generate a plurality of TF-IDF vector results related to a plurality of document content classes; training the document content type classification model using the plurality of TF-IDF vector results; extracting, from a file comprising an unstructured oilfield document, a plurality of terms, calculating TF-IDF of the plurality of terms to generate an input vector, executing a document content classification model on the input vector to generate a document content classification of unstructured oilfield document, extracting table information from a table in the unstructured oilfield document, and storing, with the file in storage, the document content classification and the table information. 9. The system of claim 8 , wherein the document content classification comprises: a plurality of document content classes each associated with a corresponding probability of the unstructured oilfield document being in the document content class. 10. The system of claim 8 , wherein extracting table information comprises: detecting a table in the unstructured oilfield document; generating a bounding box around the table; detecting a plurality of rows and a plurality of columns of the table using the bounding box; extracting contents from the plurality of rows and the plurality of columns; interrelating the contents in the plurality of rows to obtain related contents; and storing the related contents in a comma separated value file. 11. The system of claim 10 , the operations further comprising: obtaining, from a table control file, a table parameter of the table, wherein the table parameter specifies whether the table comprises a plurality of vertical lines, detecting the plurality of vertical lines in the table based on the table parameter; and wherein detecting the plurality of columns is performed using the plurality of vertical lines. 12. The system of claim 10 , the operations further comprising: obtaining, from a table control file, a table parameter of the table, wherein the table parameter specifies whether the table comprises a plurality of horizontal lines, detecting the plurality of horizontal lines in the table based on the table parameter; and wherein detecting the plurality of columns is performed using the plurality of horizontal lines. 13. The system of claim 8 , the operations further comprising: obtaining a control file comprising: a model specification of the document type classification model, and a data extraction control file path specifying a location to store the document content classification and the table information. 14. The system of claim 8 , the operations further comprising: extracting file metadata of the file; and cataloging the unstructured oilfield document using the file metadata. 15. A non-transitory computer readable medium comprising instructions that, when executed by a computer processor, perform operations comprising: obtaining, for a plurality of oilfield document content classes, a training set comprising a plurality of documents; calculating an inverse document frequency from the plurality of documents in the training set; calculating term frequency inverse document frequency (TF-IDF) of terms in the training data set to generate a plurality of TF-IDF vector results related to a plurality of document content classes; training the document content type classification model using the plurality of TF-IDF vector results; extracting, from a file comprising an unstructured oilfield document, a plurality of terms; calculating TF-IDF of the plurality of terms to generate an input vector; executing a document content classification model on the input vector to generate a document content classification of unstructured oilfield document; extracting table information from a table in the unstructured oilfield document; and storing, with the file in storage, the document content classification and the table information. 16. The non-transitory computer readable medium of claim 15 , wherein the document content classification comprises: a plurality of document content classes each associated with a corresponding p
Recognition assisted with metadata · CPC title
Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title
based on the type of document · CPC title
Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables · CPC title
Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.