What technology area does this patent fall under?

Primary CPC classification G06F40/154. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 09 2015 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Electronic document source ingestion for natural language processing systems

US9053085B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9053085-B2
Application number	US-201213709413-A
Country	US
Kind code	B2
Filing date	Dec 10, 2012
Priority date	Dec 10, 2012
Publication date	Jun 9, 2015
Grant date	Jun 9, 2015

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The data store for a natural-language computing system may include information that originates from a plurality of different data sources—e.g., journals, websites, magazines, reference books, and the like. In one embodiment, the information or text from the data sources are converted into a single, shared format and stored as objects in a data store. In order to ingest the different documents with their respective formats, a natural language processing system may perform preprocessing to change the different formats into a normalized format. When a new text document is received, the text may be correlated to a particular properties file which includes instructions specifying how the preprocessor should interpret the received text. Based on these instructions, a preprocessor identifies relevant portions of the text document and assigns these portions to formatting elements in the normalized format. The text may then be stored in the objects based on this assignment.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: a computer processor; and a memory containing a program that, when executed on the computer processor, performs an operation for processing data, comprising: receiving a plurality of electronic documents, wherein each electronic document is arranged according to a different, respective format comprising a plurality of headers; identifying a properties file associated with one of the electronic documents, the properties file defining a particular header of the respective format in the one electronic document, an action corresponding to a text portion associated with the particular header, and an extension class; instantiating a preprocessor for parsing the one electronic document based on the extension class, wherein the preprocessor is configured to parse only electronic documents arranged using the respective format; parsing the one electronic document to identify the particular header using the preprocessor; upon identifying the text portion associated with the particular header, performing the action to the text portion by assigning the text portion to a formatting element of a normalized format; and storing the text portion into a natural language processing (NLP) object based on the formatting element of the normalized format, wherein text in the NLP object is arranged based on the normalized format. 2. The system of claim 1 , wherein the properties file is one of a plurality of properties files, wherein each properties file is associated with one of the respective formats of the electronic documents. 3. The system of claim 1 , wherein the property file includes a plurality of formatting elements of the respective format, the plurality of formatting elements comprises a title and a section in the one electronic document. 4. The system of claim 1 , wherein the NLP object comprises text portions retrieved from other ones of the plurality of electronic documents, wherein the text portions are assigned to the formatting element of the normalized format. 5. The system of claim 4 , wherein the NLP object is a common analysis system (CAS) data structure. 6. The system of claim 1 , further comprising: annotating the text in the NLP object for use in a natural-language computing system where the natural-language computing system uses the annotated text to communicate with a user. 7. The system of claim 1 , wherein instantiating the preprocessor comprises: selecting a type of preprocessor based on the extension class, wherein each type of preprocessor corresponds to a different data source transmitting the plurality of electronic documents. 8. A computer program product comprising: a non-transitory computer-readable storage medium having computer-readable program code embodied therewith, the computer-readable program code comprising computer-readable program code configured to: receive a plurality of electronic documents, wherein each electronic document is arranged according to a different, respective format comprising a plurality of headers; identify a properties file associated with one of the electronic documents, the properties file defining a particular header of the respective format in the one electronic document, an action corresponding to a text portion associated with the particular header, and an extension class; instantiate a preprocessor for parsing the one electronic document based on the extension class, wherein the preprocessor is configured to parse only electronic documents arranged using the respective format; parse the one electronic document to identify the particular header using the preprocessor; upon identifying the text portion associated with the particular header, perform the action to the text portion by assigning the text portion to a formatting element of a normalized format; and store the text portion into a natural language processing (NLP) object based on the formatting element of the normalized format, wherein text in the NLP object is arranged based on the normalized format. 9. The computer program product of claim 8 , wherein the properties file is one of a plurality of properties files, wherein each properties file is associated with one of the respective formats of the electronic documents. 10. The computer program product of claim 8 , wherein the property file includes a plurality of formatting elements of the respective format, the plurality of formatting elements comprises a title and a section in the one electronic document. 11. The computer program product of claim 8 , wherein the NLP object comprises text portions retrieved from other ones of the plurality of electronic documents, wherein the text portions are assigned to the formatting element of the normalized format, and wherein the NLP object is a common analysis system (CAS) data structure. 12. The computer program product of claim 8 , further comprising computer-readable program code configured to: annotate the text in the NLP object for use in a natural-language computing system where the natural-language computing system uses the annotated text to communicate with a user. 13. The computer program product of claim 8 , wherein instantiating the preprocessor comprises computer-readable program code configured to: select a type of preprocessor based on the extension class, wherein each type of preprocessor corresponds to a different data source transmitting the plurality of electronic documents.

Assignees

Inventors

Dubbels Joel C

Classifications

G06F40/154Primary
Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets · CPC title
G06F40/205Primary
Parsing · CPC title
G06F17/227
Physics · mapped topic
G06F17/2705Primary
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 50882150

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9053085B2 cover?: The data store for a natural-language computing system may include information that originates from a plurality of different data sources—e.g., journals, websites, magazines, reference books, and the like. In one embodiment, the information or text from the data sources are converted into a single, shared format and stored as objects in a data store. In order to ingest the different documents w…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F40/154. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 09 2015 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).