System and a method for associating contextual structured data with unstructured documents on map-reduce

US10915537B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10915537-B2
Application numberUS-201615229485-A
CountryUS
Kind codeB2
Filing dateAug 5, 2016
Priority dateAug 27, 2015
Publication dateFeb 9, 2021
Grant dateFeb 9, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In an approach for integrating documents a processor extracts a first set of keywords from at least one structured document. A processor generates a first batch of keywords from the first set of keywords, wherein each keyword in the first batch of keywords includes a weight. A processor extracts a second set of keywords from at least one unstructured document. A processor compares the first batch of keywords to the second set of keywords. A processor determines that the at least one unstructured document matches, based on a predetermined threshold, the at least one structured document, based on the comparison of the first batch of keywords to the second set of keywords. A processor removes the at least one unstructured document from a list of unstructured documents which are to be processed.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for integrating documents, the method comprising: extracting, by one or more processors, and via a first MapReduce script, a first set of keywords from at least one structured document; generating, by one or more processors, a first batch of keywords from the first set of keywords, wherein each given keyword in the first batch of keywords is assigned a respectively corresponding weight, wherein each respectively corresponding weight is based on a number of times the given keyword appears in the structured data; extracting, by one or more processors, and via a second MapReduce script, a second set of keywords from at least one unstructured document; comparing, by one or more processors, the first batch of keywords to the second set of keywords; determining, by one or more processors, that the at least one unstructured document matches, based on a predetermined threshold, the at least one structured document, based on the comparison of the first batch of keywords to the second set of keywords, to produce an output dataset; and storing the output dataset in a storage repository in a storage location that is selected based on a keyword in the unstructured document and the respectively corresponding weight assigned to each keyword in the structured data; wherein the at least one structured document, the at least one unstructured document, and the output dataset, are stored in a MapReduce distributed storage system. 2. The method of claim 1 , wherein the respectively corresponding weight of each keyword in the first batch of keywords indicates a frequency of appearance in the at least one structured document. 3. The method of claim 1 , wherein generating the first batch of keywords from the first set of keywords further comprises: prioritizing, by one or more processors, the first batch of keywords based on the respectively corresponding weight of each keyword in the first batch of keywords. 4. The method of claim 3 , wherein comparing the first batch of keywords to the second set of keywords comprises: comparing, by one or more processors, the first batch of keywords to the second set of keywords based on the respectively corresponding weight associated with each keyword in the first batch of keywords. 5. The method of claim 1 , wherein extracting the second set of keywords from at least one unstructured document comprises: extracting, by one or more processors, the second set of keywords from the at least one unstructured document based on a presence of at least one keyword of the first batch of keywords in the at least one unstructured document. 6. The method of claim 1 , further comprising: processing, by one or more processors, the list of unstructured documents until a minimum number of keywords from the first set of keywords have been processed. 7. The method of claim 1 , further comprising: determining, by one or more processors, that a minimum number of the at least one unstructured document from the list of unstructured documents have not been removed; generating, by one or more processors, a second batch of keywords from the first set of keywords, wherein each keyword in the second batch of keywords includes a respectively corresponding weight; extracting, by one or more processors, a third set of keywords from at least one unstructured document; comparing, by one or more processors, the third batch of keywords to the third set of keywords; and determining, by one or more processors, that the at least one unstructured document matches, based on a predetermined threshold, the at least one structured document, based on the comparison of the second batch of keywords to the third set of keywords.

Assignees

Inventors

Classifications

  • G06F16/93Primary

    Document management systems · CPC title

  • Query processing · CPC title

  • Query execution (filtering based on additional data G06F16/335) · CPC title

  • using ranking · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10915537B2 cover?
In an approach for integrating documents a processor extracts a first set of keywords from at least one structured document. A processor generates a first batch of keywords from the first set of keywords, wherein each keyword in the first batch of keywords includes a weight. A processor extracts a second set of keywords from at least one unstructured document. A processor compares the first bat…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/93. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 09 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).