Targeted disambiguation of named entities
US-9594831-B2 · Mar 14, 2017 · US
US10346439B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10346439-B2 |
| Application number | US-201514635709-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 2, 2015 |
| Priority date | Mar 6, 2014 |
| Publication date | Jul 9, 2019 |
| Grant date | Jul 9, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present subject matter relates to entity resolution, and in particular, relates to providing an entity resolution from documents. The method comprises obtaining a plurality of documents corresponding to a plurality of entities, from at least one data source. Upon receiving the plurality of documents, the plurality of documents is blocked into at least one bucket based on textual similarity. Further, a graph including a plurality of record vertices and at least one bucket vertex is created. The plurality of record vertices and the at least one bucket vertex are indicative of the plurality of documents and the at least one bucket, respectively. Subsequently, a notification is provided to a user for selecting one of a Bucket-Centric Parallelization (BCP) technique and a Record-Centric Parallelization (RCP) technique for resolving entities from the plurality of documents. Based on the selection, a resolved entity-document for each entity is created.
Opening claim text (preview).
We claim: 1. A method for resolving entities from a plurality of documents, the method comprising: obtaining, by a processor, the plurality of documents, corresponding to a plurality of entities, from at least one data source, and assigning a unique identification (ID) to each of the plurality of documents; blocking, by the processor, the plurality of documents into a plurality of buckets based on textual similarity by providing the unique IDs of the plurality of documents to the plurality of buckets instead of blocking the plurality of documents themselves; discarding one or more singleton buckets having only one document; creating, by the processor, a graph including a plurality of record vertices and a plurality of bucket vertices, wherein the plurality of record vertices and the plurality of bucket vertices are indicative of the plurality of documents and the plurality of buckets, respectively, wherein each of the plurality of documents and the plurality of buckets are indicated as a vertex in the graph, and the plurality of record vertices and the plurality of bucket vertices are connected to each other by edges, depending on the blocking of the plurality of documents, wherein each of the edges between the record vertices and the bucket vertices are bi-directional; creating an adjacency list for each record vertex and each bucket vertex, wherein the adjacency list of the record vertex includes information of bucket vertices to which the record vertex hashed to, and an adjacency list of the bucket vertex includes information of record vertices hashed to the bucket vertex, selecting one of a Bucket-Centric Parallelization (BCP) technique and a Record-Centric Parallelization (RCP) technique for resolving entities from the plurality of documents based on the blocking of the plurality of documents into the plurality of buckets, wherein the Bucket-Centric Parallelization (BCP) technique is selected when the blocking of the plurality of documents into the plurality of buckets is uniform and the Record-Centric Parallelization (RCP) technique is selected when the blocking of the plurality of documents into the plurality of buckets is non-uniform, wherein the RCP technique utilizes less time than the BCP technique for entity resolution in a case of a non-uniform distribution of the plurality of documents in the plurality of buckets, wherein in the BCP and RCP techniques, the record vertices and the bucket vertices are communicating with each other in a distributed computing setting via message passing, and the bucket vertices and the record vertices are distributed across multiple processors, and wherein the BCP technique comprises: providing, by the processor, a value of each record vertex to one or more bucket vertices based on the adjacency list of a record vertex, wherein the adjacency list of the record vertex is indicative of a list of bucket vertices the record vertex is blocked to, and the value includes a document content corresponding to each record vertex; receiving the document content of the record vertices hashed to each bucket vertex at each bucket vertex and creating, by the processor, at each bucket vertex, a merged document for each entity based on an Iterative Match-Merge (IMM) technique, wherein at each bucket vertex, from the plurality of documents available at each bucket vertex, at least one matching pair of documents is identified and the at least one matching pair of documents is merged to create the merged document for each entity termed as a ‘partial entity’ at each bucket vertex, wherein a set of partial entities are created at each bucket vertex; obtaining a plurality of partial entities from the sets of partial entities, belonging to the same entity from the plurality of bucket vertices, wherein the plurality of partial entities belonging to the same entity share at least one record vertex and thereby the plurality of partial entities are connected to each other, wherein one or more connected record vertices are identified by, selecting, for each partial entity, one of the record vertices as a central record vertex, creating a bi-directional edge between the central record vertex and each of the remaining record vertices of the partial entity, thereby connecting the record vertices involved in each of the partial entity to each other through the central record vertex; and identifying the one or more connected record vertices, wherein the record vertices belonging to two or more partial entities are connected and considered to be belonging to the same entity; providing a connected component ID (CCID) to each of the connected record vertices, wherein the CCID is indicative of the entity to which the record vertex is resolved; and generating, by the processor, a resolved entity-document for each entity by consolidating the merged documents corresponding to the connected record vertices pertaining to each entity from each bucket; and wherein the RCP technique comprises handling the non-uniform distribution of the records at the plurality of buckets by performing the iterative match merge computation for the records mapped to the same bucket back to the record vertices themselves to achieve parallelization of load of IMM computations of the records vertices, wherein the RCP technique comprises: a) providing, from each bucket vertex, by the processor, a comparison message to each of the plurality of record vertices hashed to the corresponding bucket vertex to schedule comparisons among the plurality of documents corresponding to the record vertices, wherein the comparison message sent to a record vertex includes the IDs of the documents to be compared with a document corresponding to the record vertex, wherein each record vertex becomes active after receiving the comparison message; b) sending, by the processor, a value of the record vertex to the record vertices whose IDs are received by the record vertex in the comparison message, wherein the value includes the document of the record vertex; c) delivering, by the processor, a match message to each of a pair of record vertices based on matching of a pair of documents corresponding to the pair of record vertices, wherein the match message includes the IDs of each of the pair of record vertices; d) consolidating, by the processor, at each record vertex, the IDs of the record vertices received as one or more match messages to create a match set, wherein the match set is indicative of a set including IDs of record vertices belonging to the same entity and sending one or more match sets to the corresponding bucket vertices; e) upon receiving the one or more match sets of connected record vertices, combining, by the processor, at each bucket vertex, the one or more match sets received from the record vertices blocked in the bucket vertex by consolidating IDs of the match sets to create a new consolidated set, wherein the consolidated sets are created until all of the match sets are disjoint; f) creating a record vertex for each disjoint consolidated set referred to as partial entity vertices, and creating bi-directional edges between the partial entity vertices and the corresponding buckets vertices and providing a partial-entity ID message to each of the record vertex the partial entity vertex is connected to; g) upon receiving the partial-entity ID message including the ID of the partial entity vertex, providing the value and the record adjacency list of the record vertex as a message to the partial entity vertex; and upon receiving the values of the connected record vertices, merging the received values to create the value of the partial entity vertex; and creating bi-directional edges between the partial entity vertex and each of the corresponding bucket vertices and deleting the corresponding record vertices; iterating the steps ‘d’ to ‘g’ until no match messages generated by treating the partial entity vertices as new r
Indexing; Web crawling techniques · CPC title
Named entity recognition · CPC title
Clustering or classification · CPC title
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Synchronous replication · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.