What technology area does this patent fall under?

Primary CPC classification G06F16/93. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 09 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and a method for associating contextual structured data with unstructured documents on map-reduce

US10915537B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10915537-B2
Application number	US-201615229485-A
Country	US
Kind code	B2
Filing date	Aug 5, 2016
Priority date	Aug 27, 2015
Publication date	Feb 9, 2021
Grant date	Feb 9, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In an approach for integrating documents a processor extracts a first set of keywords from at least one structured document. A processor generates a first batch of keywords from the first set of keywords, wherein each keyword in the first batch of keywords includes a weight. A processor extracts a second set of keywords from at least one unstructured document. A processor compares the first batch of keywords to the second set of keywords. A processor determines that the at least one unstructured document matches, based on a predetermined threshold, the at least one structured document, based on the comparison of the first batch of keywords to the second set of keywords. A processor removes the at least one unstructured document from a list of unstructured documents which are to be processed.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for integrating documents, the method comprising: extracting, by one or more processors, and via a first MapReduce script, a first set of keywords from at least one structured document; generating, by one or more processors, a first batch of keywords from the first set of keywords, wherein each given keyword in the first batch of keywords is assigned a respectively corresponding weight, wherein each respectively corresponding weight is based on a number of times the given keyword appears in the structured data; extracting, by one or more processors, and via a second MapReduce script, a second set of keywords from at least one unstructured document; comparing, by one or more processors, the first batch of keywords to the second set of keywords; determining, by one or more processors, that the at least one unstructured document matches, based on a predetermined threshold, the at least one structured document, based on the comparison of the first batch of keywords to the second set of keywords, to produce an output dataset; and storing the output dataset in a storage repository in a storage location that is selected based on a keyword in the unstructured document and the respectively corresponding weight assigned to each keyword in the structured data; wherein the at least one structured document, the at least one unstructured document, and the output dataset, are stored in a MapReduce distributed storage system. 2. The method of claim 1 , wherein the respectively corresponding weight of each keyword in the first batch of keywords indicates a frequency of appearance in the at least one structured document. 3. The method of claim 1 , wherein generating the first batch of keywords from the first set of keywords further comprises: prioritizing, by one or more processors, the first batch of keywords based on the respectively corresponding weight of each keyword in the first batch of keywords. 4. The method of claim 3 , wherein comparing the first batch of keywords to the second set of keywords comprises: comparing, by one or more processors, the first batch of keywords to the second set of keywords based on the respectively corresponding weight associated with each keyword in the first batch of keywords. 5. The method of claim 1 , wherein extracting the second set of keywords from at least one unstructured document comprises: extracting, by one or more processors, the second set of keywords from the at least one unstructured document based on a presence of at least one keyword of the first batch of keywords in the at least one unstructured document. 6. The method of claim 1 , further comprising: processing, by one or more processors, the list of unstructured documents until a minimum number of keywords from the first set of keywords have been processed. 7. The method of claim 1 , further comprising: determining, by one or more processors, that a minimum number of the at least one unstructured document from the list of unstructured documents have not been removed; generating, by one or more processors, a second batch of keywords from the first set of keywords, wherein each keyword in the second batch of keywords includes a respectively corresponding weight; extracting, by one or more processors, a third set of keywords from at least one unstructured document; comparing, by one or more processors, the third batch of keywords to the third set of keywords; and determining, by one or more processors, that the at least one unstructured document matches, based on a predetermined threshold, the at least one structured document, based on the comparison of the second batch of keywords to the third set of keywords.

Assignees

Inventors

Classifications

G06F16/93Primary
Document management systems · CPC title
G06F16/3331
Query processing · CPC title
G06F16/334
Query execution (filtering based on additional data G06F16/335) · CPC title
G06F16/24578Primary
using ranking · CPC title

Patent family

Related publications grouped by family.

View patent family 58096605

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10915537B2 cover?: In an approach for integrating documents a processor extracts a first set of keywords from at least one structured document. A processor generates a first batch of keywords from the first set of keywords, wherein each keyword in the first batch of keywords includes a weight. A processor extracts a second set of keywords from at least one unstructured document. A processor compares the first bat…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F16/93. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 09 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Apparatus and method for searching information based on wikipedia's contents

System and method for extracting facts from unstructured text

Search engine for information retrieval system

Frequently asked questions