Extracting facts from unstructured text

US9424524B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9424524-B2
Application numberUS-201414557802-A
CountryUS
Kind codeB2
Filing dateDec 2, 2014
Priority dateDec 2, 2013
Publication dateAug 23, 2016
Grant dateAug 23, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method for extracting facts from unstructured text files are disclosed. Embodiments of the disclosed system and method may receive a text file as input and perform extraction and disambiguation of entities, as well as extract topics and facts. The facts are extracted by comparing against a fact template store and associating facts with events or topics. The extracted facts are stored in a data store.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving, by an entity extraction computer, an electronic document having unstructured text, wherein the electronic document is a text file; extracting, by the entity extraction computer, an entity identifier from the unstructured text in the electronic document; extracting, by a topic extraction computer, a topic identifier from the unstructured text in the electronic document; extracting, by a fact extraction computer, a fact identifier from the unstructured text in the electronic document by comparing text string structures in the unstructured text to a fact template database, wherein the fact template database having stored therein a fact template model identifying keywords pertaining to specific fact identifiers and corresponding keyword weights; and associating, by a fact relatedness estimator computer, the entity identifier with the topic identifier and the fact identifier to determine a confidence score indicative of a degree of accuracy of extraction of the fact identifier, wherein the confidence score is based at least in part on a spatial distance between a part of the unstructured text in the electronic document from where the fact identifier was extracted and a part of the unstructured text from where at least one of the topic identifier or the entity identifier was extracted. 2. The method of claim 1 , wherein the distance in the unstructured text is calculated using tokenization. 3. The method of claim 1 , wherein the confidence score is further based at least in part on comparing co-occurring entity identifiers in the electronic document. 4. The method of claim 1 wherein the fact template model includes metadata. 5. The method of claim 4 wherein the metadata includes a count of a number of times a sentence structure corresponding to the fact template model is repeated across a plurality of electronic documents. 6. The method of claim 4 wherein the confidence score is stored in the metadata. 7. A system comprising: one or more server computers having one or more processors executing computer readable instructions for a plurality of computer modules including: an entity extraction module which receives an electronic document having unstructured text and extracts an entity identifier from the unstructured text in the electronic document, wherein the electronic document is a text file; a topic extraction module which extracts a topic identifier from the unstructured text in the electronic document; a fact extraction module which extracts a fact identifier from the unstructured text in the electronic document by comparing text string structures in the unstructured text to a fact template database, wherein the fact template database having stored therein a fact template model identifying keywords pertaining to specific fact identifiers and corresponding keyword weights; and a fact relatedness estimator module which associates the entity identifier with the topic identifier and the fact identifier to determine a confidence score indicative of a degree of accuracy of extraction of the fact identifier, wherein the confidence score is based at least in part on a spatial distance between a part of the unstructured text in the electronic document from where the fact identifier was extracted and a part of the unstructured text from where at least one of the topic identifier or the entity identifier was extracted. 8. The system of claim 7 , wherein the distance in the unstructured text is calculated using tokenization. 9. The system of claim 7 , wherein the confidence score is further based at least in part on comparing co-occurring entity identifiers in the electronic document. 10. The system of claim 7 wherein the fact template model includes metadata. 11. The system of claim 10 wherein the metadata includes a count of a number of times a sentence structure corresponding to the fact template model is repeated across a plurality of electronic documents. 12. The system of claim 10 wherein the confidence score is stored in the metadata. 13. A non-transitory computer readable medium having stored thereon computer executable instructions instructive of a method comprising: receiving, by an entity extraction computer, an electronic document having unstructured text, wherein the electronic document is a text file; extracting, by the entity extraction computer, an entity identifier from the unstructured text in the electronic document; extracting, by a topic extraction computer, a topic identifier from the unstructured text in the electronic document; extracting, by a fact extraction computer, a fact identifier from the unstructured text in the electronic document by comparing text string structures in the unstructured text to a fact template database, the fact template database having stored therein a fact template model identifying keywords pertaining to specific fact identifiers and corresponding keyword weights; and associating, by a fact relatedness estimator computer, the entity identifier with the topic identifier and the fact identifier to determine a confidence score indicative of a degree of accuracy of extraction of the fact identifier, wherein the confidence score is based at least in part on a spatial distance between a part of the unstructured text in the electronic document from where the fact identifier was extracted and a part of the unstructured text from where at least one of the topic identifier or the entity identifier was extracted. 14. The non-transitory computer readable medium of claim 13 , wherein the distance in the unstructured text is calculated using tokenization. 15. The non-transitory computer readable medium of claim 13 , wherein the confidence score is further based at least in part on comparing co-occurring entity identifiers in the electronic document. 16. The non-transitory computer readable medium of claim 13 wherein the fact template model includes metadata. 17. The non-transitory computer readable medium of claim 16 wherein the metadata includes a count of a number of times a sentence structure corresponding to the fact template model is repeated across a plurality of electronic documents.

Assignees

Inventors

Classifications

  • G06N7/01Primary

    Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • using ranking · CPC title

  • Recognition of textual entities · CPC title

  • Query execution (filtering based on additional data G06F16/335) · CPC title

  • G06F16/35Primary

    Clustering; Classification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9424524B2 cover?
A system and method for extracting facts from unstructured text files are disclosed. Embodiments of the disclosed system and method may receive a text file as input and perform extraction and disambiguation of entities, as well as extract topics and facts. The facts are extracted by comparing against a fact template store and associating facts with events or topics. The extracted facts are stor…
Who is the assignee on this patent?
Qbase Llc
What technology area does this patent fall under?
Primary CPC classification G06N7/01. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 23 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).