Bit vector record linkage
US-10096381-B1 · Oct 9, 2018 · US
US10409788B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10409788-B2 |
| Application number | US-201715413144-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 23, 2017 |
| Priority date | Jan 23, 2017 |
| Publication date | Sep 10, 2019 |
| Grant date | Sep 10, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Systems and methods are provided herein for multi-pass duplicate identification using sorted neighborhoods. Data comprising a plurality of data records is received. Neighborhood records are generated by merging the plurality of data records with reference records stored in a remote data store. A resource identification field is assigned to each reference record. A pair distance, for each pair of neighborhood records having different resource identification fields, is determined by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value. Possible duplicate records are identified by evaluating each pair distance against a threshold, each possible duplicate having grouped attributes. Final duplicate records are identified by matching each group to a key.
Opening claim text (preview).
What is claimed is: 1. A method for implementation by a database management system, the method comprising: receiving data comprising a plurality of data records; generating a plurality of neighborhood records by merging the plurality of data records with a plurality of reference records stored in a remote data store; assigning a resource identification field to each reference record; determining a pair distance, for each pair of neighborhood records having different resource identification fields, by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value; identifying a plurality of potential duplicate records by evaluating each pair distance against a threshold, each potential duplicate having grouped attributes, the threshold being a product of a fuzzy factor and a maximum string length of an attribute; and identifying a plurality of final duplicate records by matching each group to a key. 2. The method according to claim 1 , wherein the method further comprises: receiving an expectation value from a user selection of the plurality of final duplicate records; and identifying a refined plurality of final duplicate records by matching the expectation value to the plurality of final duplicate records. 3. The method according to claim 1 , wherein each neighborhood record comprises a plurality of attributes categorized based on a plurality of standardized attributes and are sorted based on a sorting key associated with each attribute. 4. The method according to claim 1 , wherein the resource identification field identifies a source location of each reference record. 5. The method according to claim 1 , wherein the filled pairs quote value is one more than the ratio of a number of unfilled attributes of the pair of neighboring records to a number of filled for each pair. 6. The method according to claim 1 , wherein the key is either a definite key or a field percentage key. 7. The method according to claim 6 , wherein the definite key is defined by a user. 8. The method according to claim 6 , wherein the field percentage key is based on a percentage of attributes within the group matches predetermined attributes. 9. The method according to claim 1 , wherein the fuzzy factor is pre-determined by a user. 10. The method according to claim 1 , wherein the plurality of data records and the plurality of neighborhood records are related to business partner screening. 11. The method according to claim 1 , wherein the receiving, generating, assigning, determining, and identifying occur in an in-memory database. 12. A non-transitory computer-programmable product including storing instructions which, when executed by at least one data processor forming part of at least one computing system, result in operations comprising: receiving data comprising a plurality of data records; generating a plurality of neighborhood records by merging the plurality of data records with a plurality of reference records stored in a remote data store; assigning a resource identification field to each reference record; determining a pair distance, for each pair of neighborhood records having different resource identification fields, by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value; identifying a plurality of potential duplicate records by evaluating each pair distance against a threshold, each potential duplicate having grouped attributes, the threshold being a product of a fuzzy factor and a maximum string length of an attribute; and identifying a plurality of final duplicate records by matching each group to a key. 13. The non-transitory computer-programmable product according to claim 12 , wherein the operations further comprise: receiving an expectation value from a user selection of the displayed plurality of final duplicate records; and identifying a refined plurality of final duplicate records by matching the expectation value to the plurality of final duplicate records. 14. The non-transitory computer-programmable product according to claim 12 , wherein each neighborhood record comprises a plurality of attributes categorized based on a plurality of standardized attributes and are sorted based on a sorting key associated with each attribute. 15. The non-transitory computer-programmable product according to claim 12 , wherein the resource identification field identifies a source location of each reference record. 16. The non-transitory computer-programmable product according to claim 12 , wherein the filled pairs quote value is one more than the ratio of a number of unfilled attributes of the pair of neighboring records to a number of filled for each pair. 17. The non-transitory computer-programmable product according to claim 12 , wherein the key is either a definite key or a field percentage key, the definite key is defined by a user, and the field percentage key is based on a percentage of attributes within the group matches predetermined attributes. 18. The non-transitory computer-programmable product according to claim 12 , wherein the fuzzy factor is pre-determined by a user. 19. A system comprising: at least one data processor; memory storing instructions which, when executed by the at least one data processor, result in operations comprising: receiving data comprising a plurality of data records; generating a plurality of neighborhood records by merging the plurality of data records with a plurality of reference records stored in a remote data store; assigning a resource identification field to each reference record; determining a pair distance, for each pair of neighborhood records having different resource identification fields, by calculating a standard deviation of distances between each attribute of the pair scaled by a filled pairs quote value; identifying a plurality of potential duplicate records by evaluating each pair distance against a threshold, each potential duplicate having grouped attributes, the threshold being a product of a fuzzy factor and a maximum string length of an attribute; and identifying a plurality of final duplicate records by matching each group to a key. 20. The system of claim 19 , wherein the filled pairs quote value is calculated to minimize a blank attribute impact, wherein the filled pairs quote value can be calculated using: Filled-pairs-quote=(filled pairs)/(all pairs)+1, where filled pairs is a number of pairs of standardized attributes that are filled and all pairs is a total number of filled attributes.
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.