Classifying an unmanaged dataset

US10592481B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10592481-B2
Application numberUS-201715480501-A
CountryUS
Kind codeB2
Filing dateApr 6, 2017
Priority dateOct 14, 2015
Publication dateMar 17, 2020
Grant dateMar 17, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer implemented method for classifying at least one source dataset of a computer system. The method may include providing a plurality of associated reference tables organized and associated in accordance with a reference storage model in the computer system. The method may also include calculating, by a data classifier application of the computer system, a first similarity score between the source dataset and a first reference table of the reference tables based on common attributes in the source dataset and a join of the first reference table with at least one further reference table of the reference tables having a relationship with the first reference table. The method may further include classifying, by the data classifier application, the source dataset by determining using at least the calculated first similarity score whether the source dataset is organized as the first reference table in accordance to the reference storage model.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer program product for classifying at least one source dataset, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to provide a plurality of associated reference tables organized and associated in accordance with a reference storage model; calculate a first similarity score between the source dataset and a first reference table of the reference tables based on common attributes in the source dataset and a join of the first reference table with at least one further reference table of the reference tables having a relationship with the first reference table, wherein the first similarity score is calculated with the following formula, Score = ∑ dist = 0 n ⁢ ⁢ S dist 1 + dist , where S dist = ∑ d = 0 ds ⁢ ⁢ 2 * W d N DS + N DT , ds is the number of common attributes, Wd=1/card(D) where card(D) is the cardinality of an attribute D in the reference tables, NDS is the number of attributes in the source dataset, NDT is the number of attributes in the reference tables, n is the number of the at least one further reference table plus the first reference table, and dist is the distance in terms of number of foreign key relationships between the first reference table and the at least one further reference table; classify the source dataset by determining using at least the calculated first similarity score whether the source dataset is organized as the first reference table in accordance the reference storage model. 2. A computer system for classifying at least one source dataset, the computer system being configured for: providing a plurality of associated reference tables organized and associated in accordance with a reference storage model; calculating a first similarity score between the source dataset and a first reference table of the reference tables based on common attributes in the source dataset and a join of the first reference table with at least one further reference table of the reference tables having a relationship with the first reference table, wherein the first similarity score is calculated with the following formula, Score = ∑ dist = 0 n ⁢ ⁢ S dist 1 + dist , where S dist = ∑ d = 0 ds ⁢ ⁢ 2 * W d N DS + N DT , ds is the number of common attributes, Wd=1/card(D) where card(D) is the cardinality of an attribute D in the reference tables, NDS is the number of attributes in the source dataset, NDT is the number of attributes in the reference tables, n is the number of the at least one further reference table plus the first reference table, and dist is the distance in terms of number of foreign key relationships between the first reference table and the at least one further reference table; and classifying the source dataset by determining using at least the calculated first similarity score whether the source dataset is organized as the first reference table in accordance the reference storage model. 3. The computer system of claim 2 , further comprising: repeating the step of calculating for a second reference table of the reference tables, wherein determining comprises comparing the first and second similarity scores for determining whether the source dataset is organized as the first reference table or as the second reference table in accordance with the reference storage model. 4. The computer system of claim 2 , wherein the repeating is performed in response to determining that the first similarity score is smaller than a predefined similarity threshold. 5. The computer system of claim 2 , wherein the at least one further reference table is selected based on at least one of the further reference table has a direct relationship with the first reference table, the further reference table has an indirect relationship with the first reference table, and the number of common attributes between the source dataset and the further reference table is smaller than the number of common attributes between the source dataset and the first reference table. 6. The computer system of claim 2 , wherein the source dataset

Assignees

Inventors

Classifications

  • G06F16/211Primary

    Schema design and management · CPC title

  • Tablespace storage structures; Management thereof · CPC title

  • Clustering or classification · CPC title

  • Comparing separate sets of record carriers arranged in the same sequence to determine whether at least some of the data in one set is identical with that in the other set or sets · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10592481B2 cover?
A computer implemented method for classifying at least one source dataset of a computer system. The method may include providing a plurality of associated reference tables organized and associated in accordance with a reference storage model in the computer system. The method may also include calculating, by a data classifier application of the computer system, a first similarity score between …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/211. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 17 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).