Who is the assignee on this patent?

Ananthanarayanan Rema, Bhamidipaty Anuradha, Kummamuru Krishna, and 5 more

What technology area does this patent fall under?

Primary CPC classification G06F16/355. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 02 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and method to extract models from semi-structured documents

US10089390B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10089390-B2
Application number	US-89006010-A
Country	US
Kind code	B2
Filing date	Sep 24, 2010
Priority date	Sep 24, 2010
Publication date	Oct 2, 2018
Grant date	Oct 2, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and associated methods for automated and semi-automated building of domain models for documents are described. Embodiments provide an approach to discover an information model by mining documentation about a particular domain captured in the documents. Embodiments classify the documents into one or more types corresponding to concepts using indicative words, identify candidate model elements (concepts) for document types, identify relationships both within and across document types, and consolidate and learn a global model for the domain.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for producing a global model describing a collection of documents comprising: executing with one or more processors one or more modules of computer program code configured for accessing a collection of documents, the collection of documents comprising labeled documents and unlabeled documents; receiving input of at least one indicative word, wherein the at least one indicative word comprises a descriptive word for classification and wherein the at least one indicative word indicates a probability of belonging to a classification based upon the indicative word occurring in a document during classification; classifying both labeled documents and unlabeled documents of the collection of documents to produce classified documents of one or more types, wherein the classifying comprises producing a domain sub-model for each document type, wherein the domain sub-model represents a graphical representation of a set of concepts contained within each document type and wherein the domain sub-model is generated using the labeled documents and the at least one indicative word; wherein the producing a domain sub-model for each document type comprises extracting concepts from each of the documents and determining relationships between the concepts, wherein the extracting concepts comprises producing concept pairs by identifying, within the collection of documents, co-occurring candidate concepts and wherein the determining relationships between the concepts comprises identifying relationship links between source and destination candidate concepts, wherein the identifying relationship links comprises extracting, from each of the documents of the collection of documents, a hierarchical structure, searching for adjacent container pairs within the hierarchical structures, and inferring directed relationships between elements within the adjacent container pairs; thereupon generating a global domain model for the documents of the collection of the documents by merging the produced domain sub-models, based on the relationships between the concepts; said generating of a global domain model comprising aggregating identified relationship links and corresponding concepts of each of the domain sub-models across the produced domain sub-models, wherein the relationship links and corresponding concepts selected for aggregation are based upon a strategy identified based upon a level of manual review; thereupon outputting the global model as a graphical representation comprising the aggregated concepts and relationship links between concepts; ascertaining one or more changes to the collection of documents; and generating a new global model based on the one or more changes to the collection of documents by reclassifying the collection of documents and generating a new global model using the new domain sub-models generated during reclassification of the collection of documents. 2. The method according to claim 1 , wherein: the documents correspond to a plurality of document types; and said extracting of concepts from the classified documents further comprises identifying concepts and links for each document type. 3. The method according to claim 2 , wherein the document types correspond to at least one of: differing domains; and differing classification models. 4. The method according to claim 1 , further comprising accepting user input indicating one or more of threshold input and validation input. 5. The method according to claim 1 , wherein the global model is output as a graph, the graph comprising a tree structure wherein nodes represent concepts and edges represent relationships between concepts. 6. A computer program product for producing a global model describing a collection of documents comprising: a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code configured to access a collection of documents, the collection of documents comprising labeled documents and unlabeled documents; computer readable program code configured to receive input of at least one indicative word, wherein the at least one indicative word comprises a descriptive word for classification and wherein the at least one indicative word indicates a probability of belonging to a classification based upon the indicative word occurring in a document during classification; computer readable program code configured to classify both labeled documents and unlabeled documents of the collection of documents to produce classified documents of one or more types, via producing a domain sub-model for each document type, wherein the domain sub-model represents a graphical representation of a set of concepts contained within each document type and wherein the domain sub-model is generated using the labeled documents and the at least one indicative word; wherein the producing a domain sub-model for each document type comprises extracting concepts from each of the documents and determine relationships between the concepts, wherein the extracting concepts comprises producing concept pairs by identifying, within the collection of documents, co-occurring candidate concepts and wherein the determining relationships between the concepts comprises identifying relationship links between source and destination candidate concepts, wherein the identifying relationship links comprises extracting, from each of the documents of the collection of documents, a hierarchical structure, searching for adjacent container pairs within the hierarchical structures, and inferring directed relationships between elements within the adjacent container pairs; computer readable program code configured to thereupon generate a global domain model for the documents of the collection of the documents by merging the produced domain sub-models, based on the relationships between the concepts, via aggregating identified relationship links and corresponding concepts of each of the domain sub-models across the produced domain sub-models, wherein the relationship links and corresponding concepts selected for aggregation are based upon a strategy identified based upon a level of manual review; computer readable program code configured to thereupon output the global model as a graphical representation comprising the aggregated concepts and relationship links between concepts; computer readable program code configured to ascertain one or more changes to the collection of documents; and computer readable program code configured to generate a new global model based on the one or more changes to the collection of documents by reclassifying the collection of documents and generating a new global model using the new domain sub-models generated during reclassification of the collection of documents. 7. The computer product according to claim 6 , wherein the documents correspond to a plurality of document types, and said computer readable program code is configured to extract concepts from the classified documents via identifying concepts and links for each document type. 8. The computer program product according to claim 6 , further comprising computer readable program code configured to ascertain user input indicating one or more of threshold input and validation input. 9. The computer program product according to claim 6 , wherein the global model is output as a graph, the graph comprising a tree structure wherein nodes represent concepts and edges represent relationships between concepts. 10. A system for producing a global model describing a collection of documents comprising: one or more processors; and a memory operatively connected to the one or more processors;

Assignees

Inventors

Classifications

G06F16/355Primary
Creation or modification of classes or clusters · CPC title
G06F17/3071Primary
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 45871727

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10089390B2 cover?: Systems and associated methods for automated and semi-automated building of domain models for documents are described. Embodiments provide an approach to discover an information model by mining documentation about a particular domain captured in the documents. Embodiments classify the documents into one or more types corresponding to concepts using indicative words, identify candidate model ele…
Who is the assignee on this patent?: Ananthanarayanan Rema, Bhamidipaty Anuradha, Kummamuru Krishna, and 5 more
What technology area does this patent fall under?: Primary CPC classification G06F16/355. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 02 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).