Identifying entity mappings across data assets

US10025846B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10025846-B2
Application numberUS-201514853823-A
CountryUS
Kind codeB2
Filing dateSep 14, 2015
Priority dateSep 14, 2015
Publication dateJul 17, 2018
Grant dateJul 17, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Entity mappings that produce matching entities for a first data asset having attributes and a second data asset having attributes are generated by: generating entity mappings that produce matching entities for a first data asset having attributes with attribute values and a second data asset having attributes with attribute values by: matching the attribute values of the attributes of the first data asset with the attribute values of the attributes of the second data asset, using the matching attribute values to generate matching attribute pairs, and using the matching attribute pairs to identify entity mappings; computing an entity mapping score for each of the entity mappings based on a combination of factors; ranking the entity mappings based on each entity mapping score; and using some of the ranked entity mappings to determine whether a same real-world entity is described by the first data asset and the second data asset.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer program product, the computer program product comprising a computer readable storage medium having program code embodied therewith, the program code executable by at least one processor to perform: generating entity mappings that produce matching entities for a first data asset having attributes with attribute values and a second data asset having attributes with attribute values by: matching the attribute values of the attributes of the first data asset with the attribute values of the attributes of the second data asset; using the matching attribute values to generate matching attribute pairs; and using the matching attribute pairs to identify entity mappings; computing an entity mapping score for each of the entity mappings based on a combination of factors; ranking the entity mappings based on each entity mapping score; and using the ranked entity mappings to determine which of the entity mappings are to be used to determine whether a same real-world entity is described by the first data asset and the second data asset. 2. The computer program product of claim 1 , wherein the program code is executable by the at least one processor to perform: generating a first inverted index of entity identifier pairs for the first data asset; generating a second inverted index of entity identifier pairs for the second data asset; and using the first inverted index and the second inverted index to generate the matching attribute pairs based on matching attribute values that form the entity mappings. 3. The computer program product of claim 1 , wherein values match fuzzily for the matching entities. 4. The computer program product of claim 1 , wherein, for computing the entity mapping score for each of the entity mappings comprises, the program code is executable by the at least one processor to perform wherein: generating an entity mapping score for factors selected from: a number of attributes involved in an entity mapping, a cardinality of that individual entity mapping, support of that entity mapping, a probability of one to one matching for that entity mapping, a join utility measure for that entity mapping, and a probability of previous user selections for that entity mapping; and adding the entity mapping score for each of the factors to generate the entity mapping score for that entity mapping. 5. The computer program product of claim 1 , wherein one of the first data asset and the second data asset is semi-structured data having hierarchical data that is flattened. 6. The computer program product of claim 1 , wherein one of the first data asset and the second data asset is an unstructured data asset formed by a collection of documents and is modelled based one of a bag of words and annotated words. 7. The computer program product of claim 1 , wherein the program code is executable by the at least one processor to perform: integrating the first data asset and the second data asset using ranked entity mappings by performing one of a join operation, a merge operation, and a union operation. 8. The computer program product of claim 1 , wherein a Software as a Service (SaaS) is configured to perform computer program product operations. 9. A computer system, comprising: one or more processors, one or more computer-readable memories and one or more computer-readable, tangible storage devices; and program instructions, stored on at least one of the one or more computer-readable, tangible storage devices for execution by at least one of the one or more processors via at least one of the one or more memories, to perform operations comprising: generating entity mappings that produce matching entities for a first data asset having attributes with attribute values and a second data asset having attributes with attribute values by: matching the attribute values of the attributes of the first data asset with the attribute values of the attributes of the second data asset; using the matching attribute values to generate matching attribute pairs; and using the matching attribute pairs to identify entity mappings; computing an entity mapping score for each of the entity mappings based on a combination of factors; ranking the entity mappings based on each entity mapping score; and using the ranked entity mappings to determine which of the entity mappings are to be used to determine whether a same real-world entity is described by the first data asset and the second data asset. 10. The computer system of claim 9 , wherein the operations further comprise: generating a first inverted index of entity identifier pairs for the first data asset; generating a second inverted index of entity identifier pairs for the second data asset; and using the first inverted index and the second inverted index to generate the matching attribute pairs based on matching attribute values that form the entity mappings. 11. The computer system of claim 9 , wherein values match fuzzily for the matching entities. 12. The computer system of claim 9 , wherein the operations for computing the entity mapping score for each of the entity mappings further comprise: generating an entity mapping score for factors selected from: a number of attributes involved in an entity mapping, a cardinality of that individual entity mapping, support of that entity mapping, a probability of one to one matching for that entity mapping, a join utility measure for that entity mapping, and a probability of previous user selections for that entity mapping; and adding the entity mapping score for each of the factors to generate the entity mapping score for that entity mapping. 13. The computer system of claim 9 , wherein one of the first data asset and the second data asset is semi-structured data having hierarchical data that is flattened. 14. The computer system of claim 9 , wherein one of the first data asset and the second data asset is an unstructured data asset formed by a collection of documents and is modelled based one of a bag of words and annotated words. 15. The computer system of claim 9 , wherein the operations further comprise: integrating the first data asset and the second data asset using ranked entity mappings by performing one of a join operation, a merge operation, and a union operation. 16. The computer system of claim 9 , wherein a Software as a Service (SaaS) is configured to perform computer system operations.

Assignees

Inventors

Classifications

  • Inverted lists · CPC title

  • G06F16/80Primary

    of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML (content-based retrieval of web data G06F16/95) · CPC title

  • Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Mapping to a database · CPC title

  • Search customisation based on user profiles and personalisation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10025846B2 cover?
Entity mappings that produce matching entities for a first data asset having attributes and a second data asset having attributes are generated by: generating entity mappings that produce matching entities for a first data asset having attributes with attribute values and a second data asset having attributes with attribute values by: matching the attribute values of the attributes of the first…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/80. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 17 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).