Extensible system and method for information extraction in a data processing system

US9418069B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9418069-B2
Application numberUS-78814210-A
CountryUS
Kind codeB2
Filing dateMay 26, 2010
Priority dateMay 26, 2010
Publication dateAug 16, 2016
Grant dateAug 16, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data mashup system having information extraction capabilities for receiving multiple streams of textual data, at least one of which contains unstructured textual data. A repository stores annotators that describe how to analyze the streams of textual data for specified unstructured data components. The annotators are applied to the data streams to identify and extract the specified data components according to the annotators. The extracted data components are tagged to generate structured data components and the specified unstructured data components in the input data streams are replaced with the tagged data components. The system then combines the tagged data from the multiple streams to form a mashup output data stream.

First claim

Opening claim text (preview).

We claim: 1. A system for performing a data mashup including information extraction capabilities, comprising: a processor; and a computer-readable storage medium coupled to the processor, wherein the computer-readable storage medium stores computer program instructions, and wherein the computer program instructions are executed by the processor to perform: receiving multiple streams of textual data, wherein at least one stream of textual data contains unstructured textual data; storing in a repository annotators describing how to analyze the multiple streams of textual data for unstructured data components that are identified by information contained in the annotators; receiving an annotate request to apply a specified set of annotators from the stored annotators to the unstructured textual data; applying the specified set of annotators to the unstructured textual data to identify and extract a plurality of unstructured data components according to the specified set of annotators, wherein the plurality of unstructured data components are contained in the unstructured textual data; tagging the extracted, unstructured data components to generate tagged, structured data components; for each portion of the unstructured textual data corresponding to at least one tagged, structured data component of the generated tagged, structured data components, adding the at least one tagged, structured data component to that portion of the unstructured textual data; and combining the multiple streams of textual data, including the at least one stream of textual data having the generated tagged, structured data components, to form a mashup output data stream. 2. The system of claim 1 , wherein the computer program instructions are executed by the processor to perform: an annotator development environment allowing the design of new annotators and the uploading of the new annotators to the repository. 3. The system of claim 1 , wherein the computer program instructions are executed by the processor to perform: searching the repository for annotators. 4. The system of claim 2 , wherein the annotator development environment and the information extraction capabilities comprise a tightly-integrated client-server system. 5. The system of claim 1 , wherein the computer program instructions are executed by the processor to perform: storing at least two annotators having identical names and version identifiers in a single active version of the repository; and allowing the extraction of unstructured data components to be extracted according to different versions of the at least two annotators. 6. The system of claim 5 , wherein the version identifiers comprise timestamps. 7. The system of claim 5 , wherein the computer program instructions are executed by the processor to perform: receiving a search request against the repository to return executable annotation rules that were current as of a version timestamp. 8. The system of claim 1 , wherein each different version of an annotator includes a timestamp. 9. The system of claim 1 , wherein the computer program instructions are executed by the processor to perform: receiving a search request against the repository to return annotation rules that were current at a time a currently executing data stream was created. 10. The method of claim 1 , wherein the computer program instructions are executed by the processor to perform: receiving a search request from a client application against the repository to return a current list of available annotation rules. 11. A computer-readable storage medium containing program code for performing a data mashup including information extraction capabilities, comprising: code for receiving multiple streams of textual data, wherein at least one stream of textual data contains unstructured textual data; code for storing in a repository annotators describing how to analyze the multiple streams of textual data for unstructured data components that are identified by information contained in the annotators; code for receiving an annotate request to apply a specified set of annotators from the stored annotators to the unstructured textual data; code for applying the specified set of annotators to the unstructured textual data to identify and extract a plurality of unstructured data components according to the specified set of annotators, wherein the plurality of unstructured data components are contained in the unstructured textual data; code for tagging the extracted, unstructured data components to generate tagged, structured data components; code for, for each portion of the unstructured textual data corresponding to at least one tagged, structured data component of the generated tagged, structured data components, adding the at least one tagged, structured data component to that portion of the unstructured textual data; and code for combining the multiple streams of textual data, including the at least one stream of textual data having the generated tagged, structured data components, to form a mashup output data stream. 12. The computer-readable storage medium of claim 11 , further comprising: code for an annotator development environment for designing new annotators and for uploading the new annotators to the repository. 13. The computer-readable storage medium of claim 12 , wherein the stored code tightly integrates the annotator development environment and the information extraction capabilities into a client-server system. 14. The computer-readable storage medium of claim 11 , further comprising: code for searching the repository for annotators. 15. The computer-readable storage medium of claim 11 , further comprising: code for storing at least two annotators having identical names and version identifiers in a single active version of the repository; and code for allowing the extraction of unstructured data components to be extracted according to different versions of the at least two annotators. 16. The computer-readable storage medium of claim 15 , wherein the code for storing version identifiers further comprises: code for storing timestamps. 17. The computer-readable storage medium of claim 15 , wherein each different version of an annotator includes a timestamp. 18. The computer-readable storage medium of claim 15 , further comprising: code for receiving a search request against the repository to return executable annotation rules that were current as of a version timestamp. 19. The computer-readable storage medium of claim 11 , further comprising: code for receiving a search request against the repository to return annotation rules that were current at a time a currently executing data stream was created. 20. The computer-readable storage medium of claim 11 , further comprising: code for receiving a search request from a client application against the repository to return a current list of available annotation rules.

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • G06F16/313Primary

    Selection or weighting of terms for indexing · CPC title

  • G06F16/122Primary

    using management policies (point-in-time backing up or restoration of persistent data G06F11/1446; file migration policies for HSM systems G06F16/185) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9418069B2 cover?
A data mashup system having information extraction capabilities for receiving multiple streams of textual data, at least one of which contains unstructured textual data. A repository stores annotators that describe how to analyze the streams of textual data for specified unstructured data components. The annotators are applied to the data streams to identify and extract the specified data compo…
Who is the assignee on this patent?
Li Yunyao, Reiss Frederick Ralph, Simmen David Everett, and 2 more
What technology area does this patent fall under?
Primary CPC classification G06F17/30082. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 16 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).