Extracting complex entities and relationships from unstructured data

US9569733B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9569733-B2
Application numberUS-201514627430-A
CountryUS
Kind codeB2
Filing dateFeb 20, 2015
Priority dateFeb 20, 2015
Publication dateFeb 14, 2017
Grant dateFeb 14, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

To extract relationships between complex entities from unstructured data, a parser parses, using an existing language model, the unstructured data to generate a parse tree. From the parse tree, a set of tokens is created. A token in the set of tokens includes a set of words found in the unstructured data. The set of tokens is inserted in the existing language model to form an enhanced language model. The unstructured data is re-parsed using the enhanced language model to create a knowledge graph. From the knowledge graph, a relationship between a subset of the set of tokens is extracted.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for extracting relationships between complex entities from unstructured data, the method comprising: parsing, using a parser application executing using a processor and a memory, using an existing language model, the unstructured data to generate a parse tree; creating, from the parse tree, a set of tokens, wherein a token in the set of tokens comprises a set of words found in the unstructured data; inserting the set of tokens in the existing language model to form an enhanced language model; re-parsing the unstructured data using the enhanced language model to create a knowledge graph; and extracting, from the knowledge graph, a relationship between a subset of the set of tokens. 2. The method of claim 1 , wherein the relationship is an expressed relationship, further comprising: identifying, as a branch in the knowledge graph a set of edges between the tokens in the subset, each edge in the set of edges using a corresponding predicate in a set of predicates; collapsing the branch of the knowledge graph such that the subset of tokens become related by a single edge representing the set of predicates; and concluding, as a part of the extracting, that tokens in the subset of tokens are related in the expressed relationship by the set of predicates. 3. The method of claim 2 , further comprising: concluding that a first token in the subset of tokens and a second token in a second subset of tokens are related in an inferred relationship, wherein tokens in the second subset are in a second expressed relationship according to collapsing a second branch in the knowledge graph; identifying a common token, wherein the branch leads from the common token to the first token and the second branch leads from the common token to the second token; and making the common token a condition of the inferred relationship. 4. The method of claim 3 , further comprising: determining that tokens in the second subset of tokens are related in the second expressed relationship by a second set of predicates. 5. The method of claim 1 , further comprising: using, as a part of creating the set of tokens, a knowledge repository, wherein the knowledge repository is related to a subject matter of the unstructured data. 6. The method of claim 1 , further comprising: using, as a part of creating the set of tokens, contents of the unstructured data. 7. The method of claim 1 , further comprising: using, as a part of creating the set of tokens, contents of a different unstructured data, wherein the unstructured data and the different unstructured data are related to a subject matter. 8. The method of claim 1 , wherein the token can be recognized as a single construct according to the enhanced language model during the re-parsing. 9. The method of claim 1 , wherein the words in the set of words appear together and refer to a concept identified in a subject matter of the unstructured data. 10. The method of claim 1 , wherein the parsing comprises a word-by-word parsing, and wherein the parse tree comprises single word entities related by single predicate edges. 11. The method of claim 1 , wherein the existing language model comprises a previously enhanced language model, further comprising: forming the previously enhanced language model by inserting in an original language model a previous set of tokens. 12. The method of claim 11 , further comprising: creating the previous set of tokens from parsing a different unstructured data. 13. The method of claim 1 , wherein the method is embodied in a computer program product comprising one or more computer-readable storage devices and computer-readable program instructions which are stored on the one or more computer-readable tangible storage devices and executed by one or more processors. 14. The method of claim 1 , wherein the method is embodied in a computer system comprising one or more processors, one or more computer-readable memories, one or more computer-readable storage devices and program instructions which are stored on the one or more computer-readable storage devices for execution by the one or more processors via the one or more memories and executed by the one or more processors. 15. A computer program product for extracting relationships between complex entities from unstructured data, the computer program product comprising: one or more computer-readable tangible storage devices; program instructions, stored on at least one of the one or more storage devices, to parse, using a parser application executing using a processor and a memory, using an existing language model, the unstructured data to generate a parse tree; program instructions, stored on at least one of the one or more storage devices, to create, from the parse tree, a set of tokens, wherein a token in the set of tokens comprises a set of words found in the unstructured data; program instructions, stored on at least one of the one or more storage devices, to insert the set of tokens in the existing language model to form an enhanced language model; program instructions, stored on at least one of the one or more storage devices, to re-parse the unstructured data using the enhanced language model to create a knowledge graph; and program instructions, stored on at least one of the one or more storage devices, to extract, from the knowledge graph, a relationship between a subset of the set of tokens. 16. The computer program product of claim 15 , wherein the relationship is an expressed relationship, further comprising: program instructions, stored on at least one of the one or more storage devices, to identify, as a branch in the knowledge graph a set of edges between the tokens in the subset, each edge in the set of edges using a corresponding predicate in a set of predicates; program instructions, stored on at least one of the one or more storage devices, to collapse the branch of the knowledge graph such that the subset of tokens become related by a single edge representing the set of predicates; and program instructions, stored on at least one of the one or more storage devices, to conclude, as a part of the extracting, that tokens in the subset of tokens are related in the expressed relationship by the set of predicates. 17. The computer program product of claim 16 , further comprising: program instructions, stored on at least one of the one or more storage devices, to conclude that a first token in the subset of tokens and a second token in a second subset of tokens are related in an inferred relationship, wherein tokens in the second subset are in a second expressed relationship according to collapsing a second branch in the knowledge graph; program instructions, stored on at least one of the one or more storage devices, to identify a common token, wherein the branch leads from the common token to the first token and the second branch leads from the common token to the second token; and program instructions, stored on at least one of the one or more storage devices, to make the common token a condition of the inferred relationship. 18. The computer program product of claim 17 , further comprising: program instructions, stored on at least one of the one or more storage devices, to determine that tokens in the second subset of tokens are related in the second expressed relationship by a second set of predicates. 19. The computer program product of claim 15 , further comprising: program instructions, stored on at least one of the one or more storage devices, to use, as a part of cr

Assignees

Inventors

Classifications

  • Lexical analysis, e.g. tokenisation or collocates · CPC title

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • Parsing · CPC title

  • Indexing structures · CPC title

  • Creation of semantic tools, e.g. ontology or thesauri · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9569733B2 cover?
To extract relationships between complex entities from unstructured data, a parser parses, using an existing language model, the unstructured data to generate a parse tree. From the parse tree, a set of tokens is created. A token in the set of tokens includes a set of words found in the unstructured data. The set of tokens is inserted in the existing language model to form an enhanced language …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N99/005. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 14 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).