Systems and methods for automatic clustering and canonical designation of related data in various data structures

US2023297582A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023297582-A1
Application numberUS-202318325616-A
CountryUS
Kind codeA1
Filing dateMay 30, 2023
Priority dateAug 19, 2015
Publication dateSep 21, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Computer implemented systems and methods are disclosed for automatically clustering and canonically identifying related data in various data structures. Data structures may include a plurality of records, wherein each record is associated with a respective entity. In accordance with some embodiments, the systems and methods further comprise identifying clusters of records associated with a respective entity by grouping the records into pairs, analyzing the respective pairs to determine a probability that both members of the pair relate to a common entity, and identifying a cluster of overlapping pairs to generate a collection of records relating to a common entity. Clusters may further be analyzed to determine canonical names or other properties for the respective entities by analyzing record fields and identifying similarities.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method comprising: generating a plurality of record pairs, wherein each record pair in the plurality of record pairs comprises a respective first record from a first plurality of records and a respective second record from a second plurality of records; applying a machine learning model to determine respective probabilities, for each of the plurality of record pairs, that the respective first record and second record of the respective record pairs are associated with a respective same entity; causing a client computing device to present any indeterminate record pairs to a user, wherein indeterminate record pairs are identified based at least in part on the respective determined probabilities for individual record pairs of the plurality of record pairs being below a pre-established threshold; receiving, from the client computing device, user feedback indicating whether the first and second record of an indeterminate record pair are associated with the same entity; retraining the machine learning model and revising the probability of the indeterminate record pair based at least in part on the user feedback; determining, based at least in part on the probabilities, respective entities associated with one or more clusters of record pairs; and outputting the clusters of record pairs and the respective entities associated with each cluster to the client computing device. 2 . The computer-implemented method of claim 1 further comprising: identifying, for each cluster of record pairs, respective geographical locations corresponding to the clusters based at least in part on the respective probabilities. 3 . The computer-implemented method of claim 2 further comprising: causing the client computing device to display a heat map including, for individual clusters of record pairs, information regarding a size of the cluster at the geographical location corresponding to the cluster. 4 . The computer-implemented method of claim 1 further comprising: determining, for the record that is included in each record pair of a first cluster of record pairs, a canonical value for at least one field based at least in part on the probabilities of the record pairs in the first cluster. 5 . The computer-implemented method of claim 1 further comprising: filtering the record pairs in a first cluster of record pairs, and wherein the entity associated with the first cluster of record pairs is determined based at least in part on the filtered record pairs. 6 . The computer-implemented method of claim 1 further comprising: pruning the record pairs in each cluster of record pairs to produce a bipartite graph. 7 . The computer-implemented method of claim 6 , wherein the record pairs are pruned based at least in part on the probabilities. 8 . A system comprising: one or more processors configured to execute computer-executable instructions to at least: generate a plurality of record pairs, wherein each record pair in the plurality of record pairs comprises a respective first record from a first plurality of records and a respective second record from a second plurality of records; apply a machine learning model to determine respective probabilities, for each of the plurality of record pairs, that the respective first record and second record of the respective record pairs are associated with a respective same entity; cause a client computing device to present any indeterminate record pairs to a user, wherein indeterminate record pairs are identified based at least in part on the respective determined probabilities for individual record pairs of the plurality of record pairs being below a pre-established threshold; receive, from the client computing device, user feedback indicating whether the first and second record of an indeterminate record pair are associated with the same entity; retrain the machine learning model and revising the probability of the indeterminate record pair based at least in part on the user feedback; determine, based at least in part on the probabilities, respective entities associated with one or more clusters of record pairs; and output the clusters of record pairs and the respective entities associated with each cluster to the client computing device. 9 . The system of claim 8 , wherein the one or more processors are configured to execute computer-executable instructions to further at least: identify, for each cluster of record pairs, respective geographical locations corresponding to the clusters based at least in part on the respective probabilities. 10 . The system of claim 9 , wherein the one or more processors are configured to execute computer-executable instructions to further at least: cause the client computing device to display a heat map including, for individual clusters of record pairs, information regarding a size of the cluster at the geographical location corresponding to the cluster. 11 . The system of claim 8 , wherein the one or more processors are configured to execute computer-executable instructions to further at least: determine, for the record that is included in each record pair of a first cluster of record pairs, a canonical value for at least one field based at least in part on the probabilities of the record pairs in the first cluster. 12 . The system of claim 8 , wherein the one or more processors are configured to execute computer-executable instructions to further at least: filter the record pairs in a first cluster of record pairs, and wherein the entity associated with the first cluster of record pairs is determined based at least in part on the filtered record pairs. 13 . The system of claim 8 , wherein the one or more processors are configured to execute computer-executable instructions to further at least: prune the record pairs in each cluster of record pairs to produce a bipartite graph. 14 . The system of claim 13 , wherein the record pairs are pruned based at least in part on the probabilities. 15 . A non-transitory computer-readable storage medium including computer-executable instructions that, when executed by one or more processors, cause the one or more processors to: generate a plurality of record pairs, wherein each record pair in the plurality of record pairs comprises a respective first record from a first plurality of records and a respective second record from a second plurality of records; apply a machine learning model to determine respective probabilities, for each of the plurality of record pairs, that the respective first record and second record of the respective record pairs are associated with a respective same entity; cause a client computing device to present any indeterminate record pairs to a user, wherein indeterminate record pairs are identified based at least in part on the respective determined probabilities for individual record pairs of the plurality of record pairs being below a pre-established threshold; receive, from the client computing device, user feedback indicating whether the first and second record of an indeterminate record pair are associated with the same entity; retrain the machine learning model and revising the probability of the indeterminate record pair based at least in part on the user feedback; determine, based at least in part on the probabilities, respective entities associated with one or more clusters of record pairs; and output the clusters of record pairs and the respective entities associated with each cluster to the client computing device. 16 . The non-transitory computer-readable sto

Assignees

Inventors

Classifications

  • G06F16/35Primary

    Clustering; Classification · CPC title

  • using ranking · CPC title

  • Search customisation based on user profiles and personalisation · CPC title

  • Clustering or classification · CPC title

  • Clustering techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023297582A1 cover?
Computer implemented systems and methods are disclosed for automatically clustering and canonically identifying related data in various data structures. Data structures may include a plurality of records, wherein each record is associated with a respective entity. In accordance with some embodiments, the systems and methods further comprise identifying clusters of records associated with a resp…
Who is the assignee on this patent?
Palantir Technologies Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/35. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Sep 21 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).