What technology area does this patent fall under?

Primary CPC classification G06F40/30. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 15 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Parallelizing semantically split documents for processing

US9971760B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9971760-B2
Application number	US-201414578545-A
Country	US
Kind code	B2
Filing date	Dec 22, 2014
Priority date	Dec 22, 2014
Publication date	May 15, 2018
Grant date	May 15, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In an approach for parallelizing document processing in an information handling system, a processor receives a document, wherein the document includes text content. A processor extracts information from the text content, utilizing natural language processing and semantic analysis, to form tokenized semantic partitions, comprising a plurality of sub-documents. A processor schedules a plurality of concurrently executing threads to process the plurality of sub-documents.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer program product for parallelizing document processing in an information handling system, the computer program product comprising: one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, executable by a computer processor that can perform reordering operations when executing program instructions in parallel, the program instructions comprising: program instructions to receive a document, wherein the document includes text content, given a particular granularity scope; program instructions to extract information from the text content, utilizing natural language processing and semantic analysis, to form tokenized semantic partitions, comprising a plurality of sub-documents, wherein: the tokenized semantic partitions each have a particular data type; the plurality of sub-documents are annotated to represent an order of occurrence within the document; and the annotated plurality of sub-documents allows reconstruction to the particular granularity scope at any point during the extraction; program instructions to reconstruct the document by scheduling a process for the annotated plurality of sub-documents, wherein: the scheduling drives each of the annotated sub-documents in parallel by using a memory barrier to enforce an ordering constraint on the annotated plurality of sub-documents based on a data dependent scheduling order using the data types of the sub-documents and a type dependency flow graph for the annotated sub-documents given the particular granularity scope, the dependency flow graph comprises information about which data types are dependent on other data types in a dependency order, and by using the type dependency flow graph, the sub-documents that have data types that do not depend upon each other can be driven in parallel and processed out of order of occurrence, while the sub-documents that have data types that depend on each other are constrained and processed according to the dependency order using the memory barrier; and retrieving, by one or more processors, numbered annotation data within the scheduled and annotated plurality of sub-documents, representing the order of occurrence, wherein the reconstructed document preserves the order of occurrence of previous extractions based on the retrieved numbered annotation data. 2. The computer program product of claim 1 , wherein the plurality of sub-documents are separate components of a document. 3. The computer program product of claim 1 , wherein the process is a plurality of concurrently executing threads. 4. The computer program product of claim 1 , wherein each sub-document is processed using a data dependency workflow, containing annotator metadata, wherein the annotator metadata has a description of input types needed and output types produced; wherein the annotator metadata includes information about the particular granularity scope at which data is expected, wherein the expected data is in sentence, paragraph, and section form; and wherein the annotator metadata includes a scope partition indicator for the particular granularity scope. 5. The computer program product of claim 1 , further comprising: program instructions, stored on the one or more computer readable storage media, to annotate each sub-document, wherein the annotation allows the determination of document domain, document layout, and document structural components; program instructions, stored on the one or more computer readable storage media, to store each annotated sub-document; and program instructions, stored on the one or more computer readable storage media, to reconstruct the document to the granularity scope using each sub document, based on information in the annotated sub-document. 6. The computer program product of claim 1 , wherein the plurality of sub-documents are partitioned based on data type and scope of the text content. 7. The computer program product of claim 6 , wherein scope of the text content includes word, sentence, and paragraph. 8. The computer program product of claim 1 , further comprising: program instructions to capture a hierarchy in the extraction, wherein a first data type is dependent on a second data type and the first data type is only extracted when the second data type is available and has been previously extracted. 9. The computer program product of claim 8 , wherein the type system: associates the first data type to a first value and the second data type to a second value; examines the flow of the first value and the second value; and prevents an operation from performing that expects the first value as an input when the second value is used as the input. 10. A computer system for parallelizing document processing in an information handling system, the computer system comprising: one or more computer processors, one or more computer readable storage media, and program instructions stored on the computer readable storage media for execution by at least one of the one or more processors, the program instructions comprising: program instructions to receive a document, wherein the document includes text content, given a particular granularity scope; program instructions to extract information from the text content, utilizing natural language processing and semantic analysis, to form tokenized semantic partitions, comprising a plurality of sub-documents, wherein: the tokenized semantic partitions each have a particular data type; the plurality of sub-documents are annotated to represent an order of occurrence within the document; and the annotated plurality of sub-documents allows reconstruction to the particular granularity scope at any point during the extraction; program instructions to reconstruct the document by scheduling a process for the annotated plurality of sub-documents, wherein: the scheduling drives each of the annotated sub-documents in parallel by using a memory barrier to enforce an ordering constraint on the annotated plurality of sub-documents based on a data dependent scheduling order using the data types of the sub-documents and a type dependency flow graph for the annotated sub-documents given the particular granularity scope, the dependency flow graph comprises information about which data types are dependent on other data types in a dependency order, and by using the type dependency flow graph, the sub-documents that have data types that do not depend upon each other can be driven in parallel and processed out of order of occurrence, while the sub-documents that have data types that depend on each other are constrained and processed according to the dependency order using the memory barrier; and retrieving, by one or more processors, numbered annotation data within the scheduled and annotated plurality of sub-documents, representing the order of occurrence, wherein the reconstructed document preserves the order of occurrence of previous extractions based on the retrieved numbered annotation data. 11. The computer system of claim 10 , wherein the plurality of sub-documents are separate components of a document. 12. The computer system of claim 10 , wherein the process is a plurality of concurrently executing threads. 13. The computer system of claim 10 , wherein each sub-document is processed using a data dependency workflow, containing annotator metadata, wherein the annotator metadata has a description of input types needed and output types produced; wherein the annotator metadata includes information about the particular granularity scope at which data is expected, wherein the expected data is in sentence, paragraph, an

Assignees

Inventors

Classifications

G06F40/30Primary
Semantic analysis · CPC title
G06F40/284Primary
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06F17/2785
Physics · mapped topic
G06F17/277Primary
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 56129598

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9971760B2 cover?: In an approach for parallelizing document processing in an information handling system, a processor receives a document, wherein the document includes text content. A processor extracts information from the text content, utilizing natural language processing and semantic analysis, to form tokenized semantic partitions, comprising a plurality of sub-documents. A processor schedules a plurality o…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 15 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).