System and Method for Extracting Table Data from Text Documents Using Machine Learning

US2016104077A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016104077-A1
Application numberUS-201514879349-A
CountryUS
Kind codeA1
Filing dateOct 9, 2015
Priority dateOct 10, 2014
Publication dateApr 14, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for extracting table data from text documents using machine learning are provided. The systems and methods comprise electronically receiving at a computer system a document having one or more tables, each table having one or more whitespace features, processing the document using a first computer model executed by the computer system to classify each row of the one or more tables as a header row or a data row, processing the document using a second computer model executed by the computer system to classify each whitespace feature in each row conditional on classification of each row by the first computer model, the second computer model identifying whether a whitespace feature corresponds to information missing from the one or more tables, and generating an output of the classified whitespace features and storing the output in a digital file.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for electronically extracting table data from text documents using machine learning, comprising: electronically receiving at a computer system a document having one or more tables, each table having one or more whitespace features; processing the document using a first computer model executed by the computer system to classify each row of the one or more tables as a header row or a data row; processing the document using a second computer model executed by the computer system to classify each whitespace feature in each row conditional on classification of each row by the first computer model, the second computer model identifying whether a whitespace feature corresponds to information missing from the one or more tables; and generating an output of the classified whitespace features and storing the output in a digital file. 2 . The method of claim 1 , wherein the first computer model comprises a random fields classifier. 3 . The method of claim 2 , wherein the random fields classifier is trained using a set of training tables. 4 . The method of claim 1 , wherein the second computer model comprises a multinomial logistic classifier. 5 . The method of claim 4 , wherein the multinomial logistic classifier is trained using a set of training tables. 6 . The method of claim 1 , wherein the information missing comprises a missing cell. 7 . A non-transitory computer-readable medium having computer-readable instructions stored thereon which, when executed by a computer system, cause the computer system to perform the steps of: electronically receiving at a computer system a document having one or more tables, each table having one or more whitespace features; processing the document using a first computer model executed by the computer system to classify each row of the one or more tables as a header row or a data row; processing the document using a second computer model executed by the computer system to classify each whitespace feature in each row conditional on classification of each row by the first computer model, the second computer model identifying whether a whitespace feature corresponds to information missing from the one or more tables; and generating an output of the classified whitespace features and storing the output in a digital file. 8 . The non-transitory computer-readable medium of claim 7 , wherein the first computer model comprises a random fields classifier. 9 . The non-transitory computer-readable medium of claim 8 , wherein the random fields classifier is trained using a set of training tables. 10 . The non-transitory computer-readable medium of claim 7 , wherein the second computer model comprises a multinomial logistic classifier. 11 . The non-transitory computer-readable medium of claim 10 , wherein the multinomial logistic classifier is trained using a set of training tables. 12 . The non-transitory computer-readable medium of claim 7 , wherein the information missing comprises a missing cell. 13 . A system for electronically extracting table data from text documents using machine learning, comprising: a computer system for electronically receiving a document having one or more tables, each table having one or more whitespace features; an engine executed by the computer system, the engine: processing the document using a first computer model executed by the computer system to classify each row of the one or more tables as a header row or a data row; processing the document using a second computer model executed by the computer system to classify each whitespace feature in each row conditional on classification of each row by the first computer model, the second computer model identifying whether a whitespace feature corresponds to information missing from the one or more tables; and generating an output of the classified whitespace features and storing the output in a digital file. 14 . The system of claim 13 , wherein the first computer model comprises a random fields classifier. 15 . The system of claim 14 , wherein the random fields classifier is trained using a set of training tables. 16 . The system of claim 13 , wherein the second computer model comprises a multinomial logistic classifier. 17 . The system of claim 16 , wherein the multinomial logistic classifier is trained using a set of training tables. 18 . The system of claim 13 , wherein the information missing comprises a missing cell.

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • of tables; using ruled lines · CPC title

  • Handling of whitespace · CPC title

  • G06N99/005Primary

    Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016104077A1 cover?
Systems and methods for extracting table data from text documents using machine learning are provided. The systems and methods comprise electronically receiving at a computer system a document having one or more tables, each table having one or more whitespace features, processing the document using a first computer model executed by the computer system to classify each row of the one or more t…
Who is the assignee on this patent?
Univ Columbia
What technology area does this patent fall under?
Primary CPC classification G06N99/005. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Apr 14 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).