Cross-organization data instance matching

US11688494B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11688494-B2
Application numberUS-201816139678-A
CountryUS
Kind codeB2
Filing dateSep 24, 2018
Priority dateSep 24, 2018
Publication dateJun 27, 2023
Grant dateJun 27, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosure provides a method for data instance processing. The method includes obtaining a set of data instances collected from a plurality of organizations. Each of the data instances includes at least one record formed in an organization that stores values of a plurality of attributes of the data instance. The method also includes dividing the set of data instances into groups, wherein data instances with conflicting values for the same attribute are divided into different groups. The method further includes subdividing data instances in each of the groups into clusters.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for two-stage data instance processing, the method comprising: obtaining, by one or more processors, a set of data instances collected from a plurality of organizations over a network, wherein each of the data instances includes at least one record formed in an organization that stores values of a plurality of attributes of the data instance; preliminarily dividing, in stage one, by conflict detection using one or more processors, the set of data instances into groups, wherein data instances with conflicting values for the same attribute are divided into different groups and stored on a remote server in a distributed cloud computing environment; subdividing, in stage two, by one or more processors, data instances in each of the groups into clusters to increase precision of the dividing, the subdividing comprising constructing, by one or more processors, for each of the data instances in each of the groups, a feature vector based on the at least one record of the data instance for further processing, with the values of the plurality of attributes being transformed into binary values for increased efficiency of the processing and minimization of bandwidth and memory requirements; and creating a new electronic record comprising the transformed binary values of the plurality of attributes of the data instances in the cluster by combining the data instances in each cluster, the electronic record being updated on a local machine and stored on the remote server in the distributed cloud computing environment to minimize network bandwidth and memory requirements for efficient iterative accessing and maintaining of the record. 2. The method according to claim 1 , wherein dividing the set of data instances into groups further includes: for each data instance in the set of data instances, forming, by one or more processors, a value sequence based on the at least one record of the data instance, wherein the value sequence includes selected attributes as its elements; constructing, by one or more processors, a conflict network in which a data instance is represented by a node and there is an edge between two nodes if two data instances represented by the two nodes have at least one conflicting element in their value sequences; and assigning, by one or more processors, labels to nodes of the conflict network with minimum labels, wherein nodes directly connected by an edge in the conflict network have different labels, and data instances represented by nodes with a same label are divided into a same group. 3. The method according to claim 1 , wherein subdividing data instances in each of the groups into clusters further includes: calculating, by one or more processors, distances between every two feature vectors; and clustering, by one or more processors, the data instances in each of the groups based on the calculated distances. 4. The method according to claim 3 , wherein constructing the feature vector further includes at least one of: splitting, by one or more processors, a first attribute of the plurality of attributes into a plurality of second attributes, wherein each of the plurality of second attributes is used to construct an element of the feature vector. 5. The method according to claim 4 , wherein the first attribute is a clinical attribute. 6. The method according to claim 1 , wherein the set of data instances are patient instances. 7. The method according to claim 2 , wherein the plurality of attributes include at least one demographic attribute and at least one clinical attribute. 8. A system for two-stage data instance processing, the system comprising: one or more processors; a memory coupled to at least one of the one or more processors; a network interface coupling the one or more processors to processors of a plurality of organizations over a network; a set of computer program instructions stored in the memory and executed by at least one of the one or more processors in order to perform actions of: obtaining a set of data instances collected over the network from the plurality of organizations, wherein each of the data instances includes at least one record formed in an organization of the plurality of organizations that stores values of a plurality of attributes of the data instance; preliminarily dividing, in stage one, conflict detection using the one or more processors, the set of data instances into groups within the memory, wherein data instances with conflicting values for the same attribute are divided into different groups and stored on a remote server in a distributed cloud computing environment; subdividing, in stage two, data instances in each of the groups into clusters to increase precision of the dividing, the subdividing comprising constructing, by one or more processors, for each of the data instances in each of the groups, a feature vector based on the at least one record of the data instance for further processing, with the values of the plurality of attributes being transformed into binary values for increased efficiency of the processing and minimization of bandwidth and memory requirements; and creating a new electronic record having the transformed binary values of the plurality of attributes of the data instances in the cluster by combining the data instances in each cluster, the electronic record being updated on a local machine and stored on the remote server in the distributed cloud computing environment to minimize network bandwidth and local machine memory requirements for efficient iterative accessing and maintaining of the record. 9. The system according to claim 8 , wherein dividing the set of data instances into groups further includes: for each data instance in the set of data instances, forming a value sequence based on the at least one record of the data instance, wherein the value sequence includes selected attributes as its elements; constructing a conflict network in which a data instance is represented by a node and there is an edge between two nodes if two data instances represented by the two nodes have at least one conflicting element in their value sequences; and assigning labels to nodes of the conflict network with minimum labels, wherein nodes directly connected by an edge in the conflict network have different labels, and data instances represented by nodes with a same label are divided into a same group. 10. The system according to claim 8 , wherein subdividing data instances in each of the groups into clusters further includes: calculating distances between every two feature vectors; and clustering the data instances in each of the groups based on the calculated distances. 11. The system according to claim 10 , wherein constructing the feature vector further includes at least one of: splitting a first attribute of the plurality of attributes into a plurality of second attributes, wherein each of the plurality of second attributes is used to construct at ent of the feature vector. 12. The system according to claim 11 , wherein the first attribute is a clinical attribute. 13. The system according to claim 8 , wherein the set of data instances are patient instances. 14. The system according to claim 9 , wherein the plurality of attributes include at least one demographic attribute and at least one clinical attribute. 15. A computer program product for data instance processing, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the program instructions being executable by one or more processor devices to cause t

Assignees

Inventors

Classifications

  • G16H10/60Primary

    for patient-specific data, e.g. for electronic patient records · CPC title

  • Clustering or classification · CPC title

  • ICT specially adapted for medical reports, e.g. generation or transmission thereof · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11688494B2 cover?
The disclosure provides a method for data instance processing. The method includes obtaining a set of data instances collected from a plurality of organizations. Each of the data instances includes at least one record formed in an organization that stores values of a plurality of attributes of the data instance. The method also includes dividing the set of data instances into groups, wherein da…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G16H10/60. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 27 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).