Systems and methods for automatic clustering and canonical designation of related data in various data structures

US10127289B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10127289-B2
Application numberUS-201615233149-A
CountryUS
Kind codeB2
Filing dateAug 10, 2016
Priority dateAug 19, 2015
Publication dateNov 13, 2018
Grant dateNov 13, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Computer implemented systems and methods are disclosed for automatically clustering and canonically identifying related data in various data structures. Data structures may include a plurality of records, wherein each record is associated with a respective entity. In accordance with some embodiments, the systems and methods further comprise identifying clusters of records associated with a respective entity by grouping the records into pairs, analyzing the respective pairs to determine a probability that both members of the pair relate to a common entity, and identifying a cluster of overlapping pairs to generate a collection of records relating to a common entity. Clusters may further be analyzed to determine canonical names or other properties for the respective entities by analyzing record fields and identifying similarities.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a data store configured to store computer-executable instructions and a plurality of records, wherein each record of the plurality of records is associated with a respective entity and comprises one or more fields; a computing device including a processor in communication with the data store, the processor configured to execute the computer-executable instructions to at least: identify, based at least in part on a first field of the one or more fields, a first group of the plurality of records; determine that a distribution of sizes of groups including the first group satisfies a distribution rule; generate one or more record pairs from the first group, each of the one or more record pairs comprising a respective first record and second record, wherein at least one field of the first record differs from a corresponding field in the second record; determine, for each of the one or more record pairs, a respective match score, the respective match scores comprising probabilities that the respective first record and second record of the respective record pair are associated with a respective same entity; identify a plurality of clusters of record pairs, wherein each pair in each cluster has a record in common with at least one other pair in the cluster, and wherein each pair in each cluster has a respective match score above a threshold; determine, for each of the plurality of clusters, that a diameter of the cluster satisfies a diameter criterion; determine, for each of the plurality of clusters, that an entropy of the cluster satisfies an entropy criterion; determine, based at least in part on the distribution of sizes of groups, the respective match scores, the diameter criterion, and the entropy criterion, that each of the plurality of clusters corresponds to a respective entity; determine, for each of the plurality of clusters, a geographical location associated with the cluster, the geographic location corresponding to the respective entity; generate, based at least in part on the geographical location associated with each cluster and a number of record pairs in each cluster, a heat map for display on a client computing device, wherein the heat map enables identification of suitable locations for providing coverage of the geographical locations associated with the clusters, wherein the heat map overlays information regarding the number of record pairs in each cluster on the geographic location associated with the cluster, and wherein the heat map displays information regarding the at least one field of individual records in each cluster as a color, symbol, shading, or other representation; and cause the client computing device to display the heat map. 2. The system of claim 1 , wherein the processor is further configured to execute the computer-executable instructions to at least: determine, based at least in part on a first pair in a first cluster of the plurality of clusters of record pairs, a first candidate name to associate with the cluster; determine, based at least in part on a second pair in the first cluster, a second candidate name based to associate with the cluster; and determine a name to associate with the first cluster based at least in part on the first candidate name and the second candidate name. 3. The system of claim 2 , wherein determining the first candidate name is based at least in part on a first field of the first record and a corresponding second field of the second record. 4. The system of claim 3 , wherein determining the first candidate name comprises identifying a longest common substring of the first field and the second field. 5. The system of claim 3 , wherein determining the first candidate name is based at least in part on calculating a Levenshtein distance between a first field of the first record and a corresponding second field of the second record. 6. The system of claim 1 , wherein the processor is further configured to execute the computer-executable instructions to identify the first group of the plurality of records by at least: accessing a first record, a second record, and a third record of the plurality of records; accessing a blocking model including information indicative of at least a first field and a second field to be compared between candidate pairs of records; comparing a value of the first field of the first record with a value of the first field of the second record to determine first matching fields; comparing a value of the second field of the first record with a value of the second field of the second record to determine second matching fields; in response to determining the first matching fields and the second matching fields, grouping the first record and the second record into the first group; comparing the value of the first field of the second record with a value of the first field of the third record to determine third matching fields; comparing the value of the second field of the second record with a value of the second field of the third record to determine fourth matching fields; and in response to determining the third matching fields and the fourth matching fields, adding the third record to the first group. 7. The system of claim 6 , wherein determining at least one of the first, second, third, or fourth matching fields is based at least in part on a soft or fuzzy match. 8. The system of claim 6 , wherein determining at least one of the first, second, third, or fourth matching fields is based at least in part on a weighting. 9. The system of claim 1 , wherein the processor is further configured to execute the computer-executable instructions to identify the first group of the plurality of records by at least: accessing a first record, a second record, and a third record of the plurality of records; accessing a blocking model including information indicative of at least a first field to be compared between candidate pairs of records and a second field to be compared between candidate pairs of records; comparing a value of the first field of the first record with a value of the first field of the second record to determine first matching fields; in response to determining the first matching fields, grouping the first record and the second record into the first group; comparing a value of the first field of the second record with a value of the first field of the third record to determine that the fields do not match; comparing the value of the second field of the second record with a value of the second field of the third record to determine second matching fields; in response to determining the second matching fields, adding the third record to the first group. 10. The system of claim 1 , wherein the processor is further configured to execute the computer-executable instructions to at least: validate the first group of the plurality of records by at least one of: determining that a size of the first group satisfies a threshold, or determining that a distribution of sizes of groups including the first group satisfies a distribution rule. 11. A method comprising: obtaining a first plurality of records, wherein each record of the first plurality of records is associated with a respective entity and comprises a first one or more fields; obtaining a second plurality of records, wherein each record of the second plurality of records is associated with a respective entity and comprises a second one or more fields, and wherein each record of the second plurality of records is associated with a different entity; identifying, based at least in part on a first field of the first one or more fields, a first subset of the first plurality of records

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10127289B2 cover?
Computer implemented systems and methods are disclosed for automatically clustering and canonically identifying related data in various data structures. Data structures may include a plurality of records, wherein each record is associated with a respective entity. In accordance with some embodiments, the systems and methods further comprise identifying clusters of records associated with a resp…
Who is the assignee on this patent?
Palantir Technologies Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/24578. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 13 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).