Apparatus and method for searching information based on wikipedia's contents
US-2015193505-A1 · Jul 9, 2015 · US
US10915537B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10915537-B2 |
| Application number | US-201615229485-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 5, 2016 |
| Priority date | Aug 27, 2015 |
| Publication date | Feb 9, 2021 |
| Grant date | Feb 9, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
In an approach for integrating documents a processor extracts a first set of keywords from at least one structured document. A processor generates a first batch of keywords from the first set of keywords, wherein each keyword in the first batch of keywords includes a weight. A processor extracts a second set of keywords from at least one unstructured document. A processor compares the first batch of keywords to the second set of keywords. A processor determines that the at least one unstructured document matches, based on a predetermined threshold, the at least one structured document, based on the comparison of the first batch of keywords to the second set of keywords. A processor removes the at least one unstructured document from a list of unstructured documents which are to be processed.
Opening claim text (preview).
What is claimed is: 1. A method for integrating documents, the method comprising: extracting, by one or more processors, and via a first MapReduce script, a first set of keywords from at least one structured document; generating, by one or more processors, a first batch of keywords from the first set of keywords, wherein each given keyword in the first batch of keywords is assigned a respectively corresponding weight, wherein each respectively corresponding weight is based on a number of times the given keyword appears in the structured data; extracting, by one or more processors, and via a second MapReduce script, a second set of keywords from at least one unstructured document; comparing, by one or more processors, the first batch of keywords to the second set of keywords; determining, by one or more processors, that the at least one unstructured document matches, based on a predetermined threshold, the at least one structured document, based on the comparison of the first batch of keywords to the second set of keywords, to produce an output dataset; and storing the output dataset in a storage repository in a storage location that is selected based on a keyword in the unstructured document and the respectively corresponding weight assigned to each keyword in the structured data; wherein the at least one structured document, the at least one unstructured document, and the output dataset, are stored in a MapReduce distributed storage system. 2. The method of claim 1 , wherein the respectively corresponding weight of each keyword in the first batch of keywords indicates a frequency of appearance in the at least one structured document. 3. The method of claim 1 , wherein generating the first batch of keywords from the first set of keywords further comprises: prioritizing, by one or more processors, the first batch of keywords based on the respectively corresponding weight of each keyword in the first batch of keywords. 4. The method of claim 3 , wherein comparing the first batch of keywords to the second set of keywords comprises: comparing, by one or more processors, the first batch of keywords to the second set of keywords based on the respectively corresponding weight associated with each keyword in the first batch of keywords. 5. The method of claim 1 , wherein extracting the second set of keywords from at least one unstructured document comprises: extracting, by one or more processors, the second set of keywords from the at least one unstructured document based on a presence of at least one keyword of the first batch of keywords in the at least one unstructured document. 6. The method of claim 1 , further comprising: processing, by one or more processors, the list of unstructured documents until a minimum number of keywords from the first set of keywords have been processed. 7. The method of claim 1 , further comprising: determining, by one or more processors, that a minimum number of the at least one unstructured document from the list of unstructured documents have not been removed; generating, by one or more processors, a second batch of keywords from the first set of keywords, wherein each keyword in the second batch of keywords includes a respectively corresponding weight; extracting, by one or more processors, a third set of keywords from at least one unstructured document; comparing, by one or more processors, the third batch of keywords to the third set of keywords; and determining, by one or more processors, that the at least one unstructured document matches, based on a predetermined threshold, the at least one structured document, based on the comparison of the second batch of keywords to the third set of keywords.
Document management systems · CPC title
Query processing · CPC title
Query execution (filtering based on additional data G06F16/335) · CPC title
using ranking · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.