Duplicative data detection

US10789240B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10789240-B2
Application numberUS-201715805047-A
CountryUS
Kind codeB2
Filing dateNov 6, 2017
Priority dateNov 6, 2017
Publication dateSep 29, 2020
Grant dateSep 29, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In some implementations, a computer-implemented method includes analyzing first data from a first data source to determine a first schema of the first data source, and analyzing second data from a second data source to determine a second schema of the second data source. The method can further include generating a first two-dimensional aggregation of a first time data series having a time dimension and a dimension corresponding to aggregated values of a first metric, and generating a second two-dimensional aggregation of a second time data series having a time dimension and a dimension corresponding to aggregated values of a second metric. The method can also include computing a correlation value between the first two-dimensional aggregation and the second two-dimensional aggregation, and providing an indication of duplicated data between the first data source and the second data source if the correlation value meets a threshold.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: programmatically analyzing first data from a first data source to determine a first schema of the first data source, the first schema including one or more dimensions of the first data from the first data source, at least one of the one or more dimensions of the first data from the first data source being a first time dimension; programmatically analyzing second data from a second data source to determine a second schema of the second data source, the second schema including one or more dimensions of the second data from the second data source, at least one of the one or more dimensions of the second data of the second data source being a second time dimension; sampling a first metric based on a first time dimension of the first data source to obtain a plurality of values for the first metric that form a first time data series, wherein the plurality of values for the first metric are sampled based on the first time dimension; sampling a second metric based on a-the second time dimension of the second data source to obtain a plurality of values for the second metric that form a second time data series, wherein the plurality of values for the second metric are sampled based on the second time dimension; generating a first two-dimensional aggregation of the first time data series having a time dimension and a dimension corresponding to aggregated values of the first metric; generating a second two-dimensional aggregation of the second time data series having a time dimension and a dimension corresponding to aggregated values of the second metric; computing a correlation value between the first two-dimensional aggregation and the second two-dimensional aggregation; and when the correlation value meets a threshold, providing an indication of duplicated data between the first data source and the second data source. 2. The computer-implemented method of claim 1 , wherein programmatically analyzing the first data source to determine the first schema of the first data source is performed using a named entity recognition technique. 3. The computer-implemented method of claim 2 , further comprising identifying, using the named entity recognition technique, one or more of: at least one dimension of the first schema of the first data source that is similar to at least one dimension of the second schema of the second data source, and at least one dimension of the first schema of the first data source and at least one dimension of the second schema of the second data source that provide different levels of granularity of a common dimension. 4. The computer-implemented method of claim 1 , wherein computing the correlation value is performed using k-means clustering. 5. The computer-implemented method of claim 1 , further comprising: repeating the sampling and generating for the first data source and the second data source using respective other metrics different from the first metric and the second metric to generate respective additional pairs of two-dimensional aggregations corresponding to the first data source and the second data source, respectively; computing respective correlation values between each of the respective additional pairs of two-dimensional aggregations; and when one or more of the respective correlation values meet the threshold, providing one or more additional indications of the duplicated data between the first data source and the second data source. 6. The computer-implemented method of claim 1 , wherein sampling the first metric based on the first time dimension of the first data source includes sampling each value of the first metric; and wherein sampling the second metric based on the second time dimension of the second data source includes sampling each value of the second metric. 7. The computer-implemented method of claim 1 , wherein providing the indication of the duplicated data includes providing a recommendation of a level of granularity of data to store. 8. The computer-implemented method of claim 1 , further comprising: identifying one or more entity to entity relationships based on the first schema and the second schema; storing the one or more entity to entity relationships in a library of relationships; and using the library of relationships to perform a duplication check for a third data source wherein the duplication check for the third data source comprises a check for duplicated data between the third data source and the first data source or the third data source and the second data source. 9. The computer-implemented method of claim 1 , wherein providing the indication of the duplicated data includes providing a user interface that includes a user interface element that, when selected, causes the duplicated data to be deleted from at least one of the first data source and the second data source. 10. The computer-implemented method of claim 9 , further comprising: upon selection of the user interface element, deleting the duplicated data from the at least one of the first data source and the second data source, wherein the deletion causes storage space utilized for storage of the first data to be lower than prior to the deletion. 11. The computer-implemented method of claim 1 , wherein providing the indication of the duplicated data between the first data source and the second data source comprises: automatically deleting the duplicated data; and providing a user interface that indicates that the duplicated data was deleted. 12. The computer-implemented method of claim 11 , wherein the user interface includes an element that indicates an amount of the duplicated data. 13. The computer-implemented method of claim 1 , wherein providing the indication of the duplicated data between the first data source and the second data source comprises providing a confidence value for the duplicated data. 14. A non-transitory computer-readable medium storing instructions, the instructions when executed by one or more processors, cause the one or more processors to: programmatically analyze first data from a first data source to determine a first schema of the first data source, the first schema including one or more dimensions of the first data from the first data source, the first data of the first data source further comprising a first time dimension; programmatically analyze second data from a second data source to determine a second schema of the second data source, the second schema including one or more dimensions of the second data from the second data source, the second data of the second data source further comprising a second time dimension; obtain first sample data from the first data source wherein the first sample data includes a plurality of values for a first metric and a respective first time value having a first time dimension; obtain second sample data from the second data source wherein the second sample data includes a plurality of values for the first metric and a respective second time value having a second time dimension, wherein the second time dimension is less granular than the first time dimension; aggregate the first sample data to generate aggregated first sample data comprising a plurality of values for the first metric, wherein the aggregated first sample data includes grouping respective subsets of the plurality of values that are within a respective particular time interval; calculate a correlation value between the aggregated first sample data and the second sample data; and when the correlation value meets a threshold, provide an indication of duplicated data between the first data source and the

Assignees

Inventors

Classifications

  • G06F16/254Primary

    Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

  • Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Named entity recognition · CPC title

  • Ensuring data consistency and integrity · CPC title

  • Updates performed during online database operations; commit processing · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10789240B2 cover?
In some implementations, a computer-implemented method includes analyzing first data from a first data source to determine a first schema of the first data source, and analyzing second data from a second data source to determine a second schema of the second data source. The method can further include generating a first two-dimensional aggregation of a first time data series having a time dimen…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/254. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 29 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).