What technology area does this patent fall under?

Primary CPC classification G06N99/005. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Apr 14 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and Method for Extracting Table Data from Text Documents Using Machine Learning

US2016104077A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2016104077-A1
Application number	US-201514879349-A
Country	US
Kind code	A1
Filing date	Oct 9, 2015
Priority date	Oct 10, 2014
Publication date	Apr 14, 2016
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for extracting table data from text documents using machine learning are provided. The systems and methods comprise electronically receiving at a computer system a document having one or more tables, each table having one or more whitespace features, processing the document using a first computer model executed by the computer system to classify each row of the one or more tables as a header row or a data row, processing the document using a second computer model executed by the computer system to classify each whitespace feature in each row conditional on classification of each row by the first computer model, the second computer model identifying whether a whitespace feature corresponds to information missing from the one or more tables, and generating an output of the classified whitespace features and storing the output in a digital file.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for electronically extracting table data from text documents using machine learning, comprising: electronically receiving at a computer system a document having one or more tables, each table having one or more whitespace features; processing the document using a first computer model executed by the computer system to classify each row of the one or more tables as a header row or a data row; processing the document using a second computer model executed by the computer system to classify each whitespace feature in each row conditional on classification of each row by the first computer model, the second computer model identifying whether a whitespace feature corresponds to information missing from the one or more tables; and generating an output of the classified whitespace features and storing the output in a digital file. 2 . The method of claim 1 , wherein the first computer model comprises a random fields classifier. 3 . The method of claim 2 , wherein the random fields classifier is trained using a set of training tables. 4 . The method of claim 1 , wherein the second computer model comprises a multinomial logistic classifier. 5 . The method of claim 4 , wherein the multinomial logistic classifier is trained using a set of training tables. 6 . The method of claim 1 , wherein the information missing comprises a missing cell. 7 . A non-transitory computer-readable medium having computer-readable instructions stored thereon which, when executed by a computer system, cause the computer system to perform the steps of: electronically receiving at a computer system a document having one or more tables, each table having one or more whitespace features; processing the document using a first computer model executed by the computer system to classify each row of the one or more tables as a header row or a data row; processing the document using a second computer model executed by the computer system to classify each whitespace feature in each row conditional on classification of each row by the first computer model, the second computer model identifying whether a whitespace feature corresponds to information missing from the one or more tables; and generating an output of the classified whitespace features and storing the output in a digital file. 8 . The non-transitory computer-readable medium of claim 7 , wherein the first computer model comprises a random fields classifier. 9 . The non-transitory computer-readable medium of claim 8 , wherein the random fields classifier is trained using a set of training tables. 10 . The non-transitory computer-readable medium of claim 7 , wherein the second computer model comprises a multinomial logistic classifier. 11 . The non-transitory computer-readable medium of claim 10 , wherein the multinomial logistic classifier is trained using a set of training tables. 12 . The non-transitory computer-readable medium of claim 7 , wherein the information missing comprises a missing cell. 13 . A system for electronically extracting table data from text documents using machine learning, comprising: a computer system for electronically receiving a document having one or more tables, each table having one or more whitespace features; an engine executed by the computer system, the engine: processing the document using a first computer model executed by the computer system to classify each row of the one or more tables as a header row or a data row; processing the document using a second computer model executed by the computer system to classify each whitespace feature in each row conditional on classification of each row by the first computer model, the second computer model identifying whether a whitespace feature corresponds to information missing from the one or more tables; and generating an output of the classified whitespace features and storing the output in a digital file. 14 . The system of claim 13 , wherein the first computer model comprises a random fields classifier. 15 . The system of claim 14 , wherein the random fields classifier is trained using a set of training tables. 16 . The system of claim 13 , wherein the second computer model comprises a multinomial logistic classifier. 17 . The system of claim 16 , wherein the multinomial logistic classifier is trained using a set of training tables. 18 . The system of claim 13 , wherein the information missing comprises a missing cell.

Assignees

Univ Columbia

Inventors

Classifications

G06N7/01
Probabilistic graphical models, e.g. probabilistic networks · CPC title
G06F40/177
of tables; using ruled lines · CPC title
G06F40/163
Handling of whitespace · CPC title
G06N99/005Primary
Physics · mapped topic
G06F17/30011
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 55655673

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016104077A1 cover?: Systems and methods for extracting table data from text documents using machine learning are provided. The systems and methods comprise electronically receiving at a computer system a document having one or more tables, each table having one or more whitespace features, processing the document using a first computer model executed by the computer system to classify each row of the one or more t…
Who is the assignee on this patent?: Univ Columbia
What technology area does this patent fall under?: Primary CPC classification G06N99/005. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Apr 14 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).