Data difference evaluation via model comparison

US2025117443A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025117443-A1
Application numberUS-202318482975-A
CountryUS
Kind codeA1
Filing dateOct 9, 2023
Priority dateOct 9, 2023
Publication dateApr 10, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for performing data difference evaluation is provided. Aspects include obtaining a first data set and a second data set, creating a first plurality of feature vectors by inputting the first data set into each of a plurality of models, and creating a second plurality of feature vectors by inputting the second data set into each of the plurality of models. Aspects also include identifying a mapping between elements of the first plurality of vectors and elements the second plurality of feature vectors created by a same model of the plurality of models, calculating, for each of the plurality of models based at least in part on the mapping, a model distance between the first data set and the second data set, and calculating, based at least in part on the model distances, an ensemble distance between first data set and the second data set.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method for data difference evaluation, the computer-implemented method comprising: obtaining a first data set and a second data set; inputting the first data set into each of a plurality of clustering models, wherein each of the plurality of clustering models separates the first data set into a different number of clusters; storing an output of each of the plurality of clustering models corresponding to the first data set into a first plurality of cluster vectors, where each of the first plurality of cluster vectors has a dimension that corresponds to the number of clusters; inputting the second data set into each of the plurality of clustering models, wherein each of the plurality of clustering models separates the second data set into a different number of clusters; storing the output of each of the plurality of clustering models corresponding to the second data set into a second plurality of cluster vectors, where each of the second plurality of cluster vectors has a dimension that corresponds to the number of clusters; identifying a mapping between elements of the first plurality of cluster vectors and elements the second plurality of cluster vectors having a same dimension; calculating, for each dimension based at least in part on the mapping, a dimensional distance between the first data set and the second data set; and calculating, based at least in part on the dimensional distances, an ensemble distance between first data set and the second data set. 2 . The computer-implemented method of claim 1 , wherein each of the elements of the first plurality of cluster vectors and the elements of the second plurality of cluster vectors each include a data cluster and wherein the mapping is identified based on a centroid for each data cluster. 3 . The computer-implemented method of claim 2 , wherein the dimensional distance between the first data set and the second data set for each dimension is calculated based on a size of the first data set, a size of the second data set, and a distance between the centroid of mapped elements of the first plurality of cluster vectors and elements of the second plurality of cluster vectors. 4 . The computer-implemented method of claim 1 , wherein the ensemble distance between first data set and the second data set is calculated as an average of the dimensional distance for each dimension. 5 . The computer-implemented method of claim 1 , wherein the ensemble distance between first data set and the second data set is calculated as a weighted average of the dimensional distance for each dimension, where a weight applied to each dimensional distance is based on a cluster quality associated with each dimension. 6 . The computer-implemented method of claim 1 , further comprising removing cluster vectors from the first plurality of cluster vectors and the second plurality of cluster vectors having a cluster quality below a threshold value. 7 . The computer-implemented method of claim 1 , wherein the plurality of clustering models are K-means clustering models. 8 . A computer program product having one or more computer readable storage media having computer readable program code collectively stored on the one or more computer readable storage media, the computer readable program code being executed by a processor of a computer system to cause the computer system to perform operations comprising: obtaining a first data set and a second data set; inputting the first data set into each of a plurality of clustering models, wherein each of the plurality of clustering models separates the first data set into a different number of clusters; storing an output of each of the plurality of clustering models corresponding to the first data set into a first plurality of cluster vectors, where each of the first plurality of cluster vectors has a dimension that corresponds to the number of clusters; inputting the second data set into each of the plurality of clustering models, wherein each of the plurality of clustering models separates the second data set into a different number of clusters; storing the output of each of the plurality of clustering models corresponding to the second data set into a second plurality of cluster vectors, where each of the second plurality of cluster vectors has a dimension that corresponds to the number of clusters; identifying a mapping between elements of the first plurality of cluster vectors and elements the second plurality of cluster vectors having a same dimension; calculating, for each dimension based at least in part on the mapping, a dimensional distance between the first data set and the second data set; and calculating, based at least in part on the dimensional distances, an ensemble distance between first data set and the second data set. 9 . The computer program product of claim 8 , wherein each of the elements of the first plurality of cluster vectors and the elements of the second plurality of cluster vectors each include a data cluster and wherein the mapping is identified based on a centroid for each data cluster. 10 . The computer program product of claim 9 , wherein the dimensional distance between the first data set and the second data set for each dimension is calculated based on a size of the first data set, a size of the second data set, and a distance between the centroid of mapped elements of the first plurality of cluster vectors and elements of the second plurality of cluster vectors. 11 . The computer program product of claim 8 , wherein the ensemble distance between first data set and the second data set is calculated as an average of the dimensional distance for each dimension. 12 . The computer program product of claim 8 , wherein the ensemble distance between first data set and the second data set is calculated as a weighted average of the dimensional distance for each dimension, where a weight applied to each dimensional distance is based on a cluster quality associated with each dimension. 13 . The computer program product of claim 8 , wherein the operations further comprise removing cluster vectors from the first plurality of cluster vectors and the second plurality of cluster vectors having a cluster quality below a threshold value. 14 . The computer program product of claim 8 , wherein the plurality of clustering models are K-means clustering models. 15 . A computing system comprising: a processor; a memory coupled to the processor; and one or more computer readable storage media coupled to the processor, the one or more computer readable storage media collectively containing instructions that are executed by the processor via the memory to cause the processor to perform operations comprising: obtaining a first data set and a second data set; inputting the first data set into each of a plurality of clustering models, wherein each of the plurality of clustering models separates the first data set into a different number of clusters; storing an output of each of the plurality of clustering models corresponding to the first data set into a first plurality of cluster vectors, where each of the first plurality of cluster vectors has a dimension that corresponds to the number of clusters; inputting the second data set into each of the plurality of clustering models, wherein each of the plurality of clustering models separates the second data set into a different number of clusters; storing the output of each of the plurality of clustering models corresponding to the second data set into a second plurality of cluster vectors, where each of the second pl

Assignees

Inventors

Classifications

  • using vector quantisation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025117443A1 cover?
A computer-implemented method for performing data difference evaluation is provided. Aspects include obtaining a first data set and a second data set, creating a first plurality of feature vectors by inputting the first data set into each of a plurality of models, and creating a second plurality of feature vectors by inputting the second data set into each of the plurality of models. Aspects al…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F18/2325. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Apr 10 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).