Techniques for relationship discovery between datasets

US10650000B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10650000-B2
Application numberUS-201715705160-A
CountryUS
Kind codeB2
Filing dateSep 14, 2017
Priority dateSep 15, 2016
Publication dateMay 12, 2020
Grant dateMay 12, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure related to techniques for analyzing data from multiple different data sources to determine a relationship between the data (also referred to herein a “data relationship discovery”). The relationships between any two compared datasets may be used to determine one or more recommendations for merging (e.g., joining), or “blending,” the data sets together. Relationship discovery may include determining a relationship between a subset of data, such as a relationship between a pair of columns, or column pair, each column in a different dataset of the datasets that are compared. Given two datasets to process for relationship discovery, relationship discovery may identify and recommends a ranked subset of column pairs between two compared datasets. The ranked column pairs identified as a relationship may be useful for blending the datasets with respect to those column pairs.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising, at a computer system: generating first profile metadata for each column of a first plurality of columns in a first dataset stored a first data source; generating second profile metadata for each column of a second plurality of columns in a second dataset stored a second data source; identifying a plurality of column pairs between the first dataset and the second dataset, wherein each column pair in the plurality of column pairs includes a different one of the first plurality of columns and a different one of the second plurality of columns; determining one or more column pairs from the plurality of identified column pairs to exclude; excluding at least one column pair from the one or more determined column pairs; for each of the one or more column pairs remaining after the excluding step: based on a type of join specified via a graphical interface, computing a plurality of scores for the column pair, each of the plurality of scores computed based on a different one of a plurality of scoring functions, the score indicating a measure for joining columns in the column pair; computing a plurality of weighted scores, each of the plurality of weighted scores computed for a different one of the plurality of scores based on applying one of a plurality of weights to the different one of the plurality of scores; and determining a pair score for the column pair, the pair score being a summation of the plurality of weighted scores; based on the pair score for each of the one or more column pairs, selecting a first column pair from the one or more column pairs; generating a third dataset based on merging, according to the type of join, the first dataset at a first column within the first column pair with the second dataset at a second column in the first column pair; and generating the graphical interface to display the generated third dataset. 2. The method of claim 1 , wherein the type of join is a left join, a right join, or an outer join. 3. The method of claim 1 , wherein the one or more column pairs is a set of column pairs, and wherein the method further comprises: determining a highest pair score from the pair score of each of the set of column pairs, wherein the first column pair is selected based on having the highest pair score. 4. The method of claim 1 , wherein excluding the at least one column pair includes determining, based on the first profile metadata and the second profile metadata for columns in each of the at least one column pair, that the columns in each of the at least one column pair do not match each other based on a semantic category. 5. The method of claim 1 , wherein excluding the at least one column pair includes determining, based on the first profile metadata and the second profile metadata for columns in each of the at least one column pair, that the columns in each of the at least one column pair do not have character sequence overlap. 6. The method of claim 1 , wherein excluding the at least one column pair includes determining, based on the first profile metadata and the second profile metadata for columns in each of the at least one column pair, that the columns in each of the at least one column pair do not have population overlap based on text length range of each of the columns. 7. The method of claim 1 , wherein excluding the at least one column pair includes determining, based on the first profile metadata and the second profile metadata for columns in each of the at least one column pair, that the columns in each of the at least one column pair do not have numerical overlap. 8. The method of claim 1 , wherein the plurality of scoring functions include a first scoring function that is based on comparing example data for the columns in the column pair of the at least the one or more column pairs. 9. The method of claim 1 , wherein the plurality of scoring functions includes a first scoring function based on column type. 10. The method of claim 1 , wherein the plurality of scoring functions includes a first scoring function based on numerical data from the first profile metadata and the second profile metadata for the columns in the column pair of the at least the one or more column pairs. 11. A system comprising: one or more processors; and a memory accessible to the one or more processors, the memory comprising instructions that, when executed by the one or more processors, cause the one or more processors to: generate first profile metadata for each column of a first plurality of columns in a first dataset stored a first data source; generate second profile metadata for each column of a second plurality of columns in a second dataset stored a second data source; identify a plurality of column pairs between the first dataset and the second dataset, wherein each column pair in the plurality of column pairs includes a different one of the first plurality of columns and a different one of the second plurality of columns; determine one or more column pairs from the plurality of identified column pairs to exclude; exclude at least one column pair from the one or more determined column pairs; for each of the one or more column pairs remaining after the excluding step: based on a type of join specified via a graphical interface, compute a plurality of scores for the column pair, each of the plurality of scores computed based on a different one of a plurality of scoring functions, the score indicating a measure for joining columns in the column pair; compute a plurality of weighted scores, each of the plurality of weighted scores computed for a different one of the plurality of scores based on applying one of a plurality of weights to the different one of the plurality of scores; and determine a pair score for the column pair, the pair score being a summation of the plurality of weighted scores; based on the pair score for each of the one or more column pairs, select a first column pair from the one or more column pairs; generate a third dataset based on merging, according to the type of join, the first dataset at a first column within the first column pair with the second dataset at a second column in the first column pair; and generate the graphical interface to display the generated third dataset. 12. The system of claim 11 , wherein the type of join is a left join, a right join, or an outer join. 13. The system of claim 11 , wherein the one or more column pairs is a set of column pairs, and wherein the instructions, when executed by the one or more processors, further cause the one or more processors to: determine a highest pair score from the pair score of each of the set of column pairs, wherein the first column pair is selected based on having the highest pair score. 14. The system of claim 11 , wherein the plurality of scoring functions include a first scoring function that is based on comparing example data for the columns in the column pair of the at least the one or more column pairs. 15. The system of claim 11 , wherein the plurality of scoring functions includes a first scoring function based on column type. 16. The system of claim 11 , wherein the plurality of scoring functions includes a first scoring function based on numerical data from the first profile metadata and the second profile metadata for the columns in the column pair of the at least the one or more column pairs. 17. A non-transitory computer readable medium storing one or more instructions that are executable by one or more processors to cause the one or more processors to: generate first profile metada

Assignees

Inventors

Classifications

  • Column-oriented storage; Management thereof · CPC title

  • Presentation of query results · CPC title

  • Join operations · CPC title

  • G06F16/254Primary

    Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10650000B2 cover?
The present disclosure related to techniques for analyzing data from multiple different data sources to determine a relationship between the data (also referred to herein a “data relationship discovery”). The relationships between any two compared datasets may be used to determine one or more recommendations for merging (e.g., joining), or “blending,” the data sets together. Relationship discov…
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06F16/2456. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 12 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).