Multi-pass duplicate identification using sorted neighborhoods and aggregation techniques

US10409788B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10409788-B2
Application numberUS-201715413144-A
CountryUS
Kind codeB2
Filing dateJan 23, 2017
Priority dateJan 23, 2017
Publication dateSep 10, 2019
Grant dateSep 10, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods are provided herein for multi-pass duplicate identification using sorted neighborhoods. Data comprising a plurality of data records is received. Neighborhood records are generated by merging the plurality of data records with reference records stored in a remote data store. A resource identification field is assigned to each reference record. A pair distance, for each pair of neighborhood records having different resource identification fields, is determined by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value. Possible duplicate records are identified by evaluating each pair distance against a threshold, each possible duplicate having grouped attributes. Final duplicate records are identified by matching each group to a key.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for implementation by a database management system, the method comprising: receiving data comprising a plurality of data records; generating a plurality of neighborhood records by merging the plurality of data records with a plurality of reference records stored in a remote data store; assigning a resource identification field to each reference record; determining a pair distance, for each pair of neighborhood records having different resource identification fields, by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value; identifying a plurality of potential duplicate records by evaluating each pair distance against a threshold, each potential duplicate having grouped attributes, the threshold being a product of a fuzzy factor and a maximum string length of an attribute; and identifying a plurality of final duplicate records by matching each group to a key. 2. The method according to claim 1 , wherein the method further comprises: receiving an expectation value from a user selection of the plurality of final duplicate records; and identifying a refined plurality of final duplicate records by matching the expectation value to the plurality of final duplicate records. 3. The method according to claim 1 , wherein each neighborhood record comprises a plurality of attributes categorized based on a plurality of standardized attributes and are sorted based on a sorting key associated with each attribute. 4. The method according to claim 1 , wherein the resource identification field identifies a source location of each reference record. 5. The method according to claim 1 , wherein the filled pairs quote value is one more than the ratio of a number of unfilled attributes of the pair of neighboring records to a number of filled for each pair. 6. The method according to claim 1 , wherein the key is either a definite key or a field percentage key. 7. The method according to claim 6 , wherein the definite key is defined by a user. 8. The method according to claim 6 , wherein the field percentage key is based on a percentage of attributes within the group matches predetermined attributes. 9. The method according to claim 1 , wherein the fuzzy factor is pre-determined by a user. 10. The method according to claim 1 , wherein the plurality of data records and the plurality of neighborhood records are related to business partner screening. 11. The method according to claim 1 , wherein the receiving, generating, assigning, determining, and identifying occur in an in-memory database. 12. A non-transitory computer-programmable product including storing instructions which, when executed by at least one data processor forming part of at least one computing system, result in operations comprising: receiving data comprising a plurality of data records; generating a plurality of neighborhood records by merging the plurality of data records with a plurality of reference records stored in a remote data store; assigning a resource identification field to each reference record; determining a pair distance, for each pair of neighborhood records having different resource identification fields, by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value; identifying a plurality of potential duplicate records by evaluating each pair distance against a threshold, each potential duplicate having grouped attributes, the threshold being a product of a fuzzy factor and a maximum string length of an attribute; and identifying a plurality of final duplicate records by matching each group to a key. 13. The non-transitory computer-programmable product according to claim 12 , wherein the operations further comprise: receiving an expectation value from a user selection of the displayed plurality of final duplicate records; and identifying a refined plurality of final duplicate records by matching the expectation value to the plurality of final duplicate records. 14. The non-transitory computer-programmable product according to claim 12 , wherein each neighborhood record comprises a plurality of attributes categorized based on a plurality of standardized attributes and are sorted based on a sorting key associated with each attribute. 15. The non-transitory computer-programmable product according to claim 12 , wherein the resource identification field identifies a source location of each reference record. 16. The non-transitory computer-programmable product according to claim 12 , wherein the filled pairs quote value is one more than the ratio of a number of unfilled attributes of the pair of neighboring records to a number of filled for each pair. 17. The non-transitory computer-programmable product according to claim 12 , wherein the key is either a definite key or a field percentage key, the definite key is defined by a user, and the field percentage key is based on a percentage of attributes within the group matches predetermined attributes. 18. The non-transitory computer-programmable product according to claim 12 , wherein the fuzzy factor is pre-determined by a user. 19. A system comprising: at least one data processor; memory storing instructions which, when executed by the at least one data processor, result in operations comprising: receiving data comprising a plurality of data records; generating a plurality of neighborhood records by merging the plurality of data records with a plurality of reference records stored in a remote data store; assigning a resource identification field to each reference record; determining a pair distance, for each pair of neighborhood records having different resource identification fields, by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value; identifying a plurality of potential duplicate records by evaluating each pair distance against a threshold, each potential duplicate having grouped attributes, the threshold being a product of a fuzzy factor and a maximum string length of an attribute; and identifying a plurality of final duplicate records by matching each group to a key. 20. The system of claim 19 , wherein the filled pairs quote value is calculated to minimize a blank attribute impact, wherein the filled pairs quote value can be calculated using: Filled-pairs-quote=(filled pairs)/(all pairs)+1, where filled pairs is a number of pairs of standardized attributes that are filled and all pairs is a total number of filled attributes.

Assignees

Inventors

Classifications

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10409788B2 cover?
Systems and methods are provided herein for multi-pass duplicate identification using sorted neighborhoods. Data comprising a plurality of data records is received. Neighborhood records are generated by merging the plurality of data records with reference records stored in a remote data store. A resource identification field is assigned to each reference record. A pair distance, for each pair o…
Who is the assignee on this patent?
Sap Se
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 10 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).