Systems and methods for resolving entity data across various data structures

US11061874B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11061874-B1
Application numberUS-201815955475-A
CountryUS
Kind codeB1
Filing dateApr 17, 2018
Priority dateDec 14, 2017
Publication dateJul 13, 2021
Grant dateJul 13, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Computer implemented systems and methods resolve data entries across multiple lists. The lists may include a plurality of records, wherein each record is associated with a respective entity. In accordance with some embodiments, the systems and methods further comprise identifying a direct field match between two lists, determining updated lists based on the remaining data entries, executing a comparison of the remaining data entries, determining a scoring metric based on the comparison, and determining whether the scoring metric exceeds a threshold. The systems and methods further comprise generating a data distribution curve based on the matched and unmatched data records and adjusting the threshold based on the data distribution curve for the next iteration of comparisons executed on the remaining unresolved entities.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a plurality of processors; at least one storage device configured to store data; and a network communication interface configured to receive a request from a remote device to perform a processing operation on a first portion of the data; wherein the system is configured to: access first data of a first data store of the storage device, wherein the first data comprises a first plurality of entity entries; access second data from a second data store, wherein the second data comprises a second plurality of entity entries, wherein each entry of the first plurality of entity entries and the second plurality of entity entries comprises a plurality of fields; identify, from the first plurality of entity entries and the second plurality of entity entries, a first set of resolved entities based on a direct field; determine, from the first set of resolved entities, updated first data and updated second data; identify a second set of resolved entities based on the updated first data and the updated second data by performing an iterative fuzzy match, wherein a match threshold is adjusted between iterations of the iterative fuzzy match based on a data distribution curve illustrating match probabilities of unresolved pairs of entities among the updated first data and the updated second data; and transmit the first set of resolved entities and the second set of resolved entities to a computing device. 2. The system of claim 1 , wherein the first data is cleansed prior to identifying the direct field. 3. The system of claim 2 , wherein the second data is cleansed prior to identifying the direct field. 4. The system of claim 2 , wherein cleansing the data comprises formatting at least one field of the plurality of entity entries. 5. The system of claim 1 , wherein determining the updated first data comprises removing, from the first data, entries corresponding to the first set of resolved entities. 6. The system of claim 1 , wherein determining the updated second data comprises removing, from the second data, entries corresponding to the first set of resolved entities. 7. The system of claim 1 , wherein one or more iterations of the iterative fuzzy match comprises: executing a comparison using a plurality of string comparators on a first field from the updated first data and a second field from the updated second data; determining a metric based on the comparison; and determining the metric exceeds the match threshold. 8. The system of claim 7 , wherein the match threshold is adjusted based at least in part on a metric obtained from the data distribution curve, wherein the metric obtained from the data distribution curve is based on a standard deviation, mean, or mode of the data distribution curve. 9. The system of claim 7 , wherein the string comparators include map comparators and/or string comparators. 10. The system of claim 7 , wherein determining the metric is based at least in part on calculating a product of a levenshtein distance calculation, a least common substring calculation, a jaccard similarity calculation, and a cosine similarity ngram calculation. 11. The system of claim 1 , wherein the remote device comprises the computing device. 12. A method comprising: accessing first data of a first data store of the storage device, wherein the first data comprises a first plurality of entity entries; accessing second data from a second data store, wherein the second data comprises a second plurality of entity entries, wherein each entry of the first plurality of entity entries and the second plurality of entity entries comprises a plurality of fields; identifying, from the first plurality of entity entries and the second plurality of entity entries, a first set of resolved entities based on a direct field; determining, from the first set of resolved entities, updated first data and updated second data; identifying a second set of resolved entities based on the updated first data and the updated second data by performing an iterative fuzzy match, wherein a match threshold is adjusted between iterations of the iterative fuzzy match based on a data distribution curve illustrating match probabilities of unresolved pairs of entities among the updated first data and the updated second data; and transmitting the first set of resolved entities and the second set of resolved entities to a computing device. 13. The method of claim 12 , wherein the first data is cleansed prior to identifying the direct field. 14. The method of claim 12 , wherein cleansing the data comprises formatting at least one field of the plurality of entity entries. 15. The method of claim 12 , wherein identifying the second set of resolved entities comprises: executing a first comparison using a plurality of string comparators on a first field from the updated first data and a second field from the updated second data; determining a metric based on the comparison; and determining the metric exceeds the match threshold. 16. The method of claim 15 , wherein identifying the second set of resolved entities further comprises: determining that one or more metrics did not exceed the match threshold; generating the data distribution curve comprising distribution data based on the one or more metrics; and adjusting the match threshold based on the data distribution curve. 17. The method of claim 16 , wherein identifying the second set of resolved entities further comprises: executing a second comparison on the entities with metrics that did not exceed the initial match threshold; determining a new metric based on the second comparison; and determining the new metric exceeds the adjusted match threshold. 18. A non-transitory computer-readable storage medium including computer-executable instructions that, when executed by a processor, cause the processor to: access first data of a first data store of the storage device, wherein the first data comprises a first plurality of entity entries; access second data from a second data store, wherein the second data comprises a second plurality of entity entries, wherein each entry of the first plurality of entity entries and the second plurality of entity entries comprises a plurality of fields; identify, from the first plurality of entity entries and the second plurality of entity entries, a first set of resolved entities based on a direct field; determine, from the first set of resolved entities, updated first data and updated second data; identify a second set of resolved entities based on the updated first data and the updated second data by performing an iterative fuzzy match, wherein a match threshold is adjusted between iterations of the iterative fuzzy match based on a data distribution curve illustrating match probabilities of unresolved pairs of entities among the updated first data and the updated second data; and transmit the first set of resolved entities and the second set of resolved entities to a computing device. 19. The non-transitory computer-readable storage medium of claim 18 , wherein the computer-executable instructions further cause the processor to, for one or more iterations of the iterative fuzzy match: execute a comparison using a plurality of string comparators on a first field from the updated first data and a second field from the updated second data; determine a metric based on the comparison; and determine the metric exceeds the match threshold.

Assignees

Inventors

Classifications

  • G06F16/258Primary

    Data format conversion from or to a database · CPC title

  • Fuzzy queries · CPC title

  • Ensuring data consistency and integrity · CPC title

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • by using string matching techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11061874B1 cover?
Computer implemented systems and methods resolve data entries across multiple lists. The lists may include a plurality of records, wherein each record is associated with a respective entity. In accordance with some embodiments, the systems and methods further comprise identifying a direct field match between two lists, determining updated lists based on the remaining data entries, executing a c…
Who is the assignee on this patent?
Palantir Technologies Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/258. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 13 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).