Scalable analysis platform for semi-structured data

US9613068B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9613068-B2
Application numberUS-201414213941-A
CountryUS
Kind codeB2
Filing dateMar 14, 2014
Priority dateMar 15, 2013
Publication dateApr 4, 2017
Grant dateApr 4, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data transformation system includes a schema inference module and an export module. The schema inference module is configured to dynamically create a cumulative schema for objects retrieved from a first data source. Each of the retrieved objects includes (i) data and (ii) metadata describing the data. Dynamically creating the cumulative schema includes, for each object of the retrieved objects, (i) inferring a schema from the object and (ii) selectively updating the cumulative schema to describe the object according to the inferred schema. The export module is configured to output the data of the retrieved objects to a data destination system according to the cumulative schema.

First claim

Opening claim text (preview).

The invention claimed is: 1. A data transformation system comprising: one or more computing devices comprising one or more hardware processors and memory and configured to implement: a schema inference module configured to dynamically create a cumulative schema for objects retrieved from a first data source, wherein: each of the retrieved objects includes (i) data and (ii) metadata describing the data; and dynamically creating the cumulative schema includes, for each object of the retrieved objects, (i) inferring a schema from the object and (ii) selectively updating the cumulative schema to describe the object according to the inferred schema; collect statistics on the data types of the retrieved objects; and based on the statistics on the data types, determine whether the data of the retrieved objects is typed correctly; and an export module configured to output the data of the retrieved objects to a data destination system according to the cumulative schema. 2. The data transformation system of claim 1 , wherein the data destination system includes a data warehouse. 3. The data transformation system of claim 2 , wherein the data warehouse stores relational data. 4. The data transformation system of claim 3 , wherein the export module is configured to convert the cumulative schema into a relational schema and output the data of the retrieved objects to the data warehouse according to the relational schema. 5. The data transformation system of claim 4 , wherein the export module is configured to generate commands for the data warehouse that update a schema of the data warehouse to reflect any changes made to the relational schema. 6. The data transformation system of claim 4 , wherein the export module is configured to create at least one intermediate file from the data of the retrieved objects according to the relational schema, wherein the at least one intermediate file has a predefined data warehouse format. 7. The data transformation system of claim 6 , wherein export module is configured to bulk load the at least one intermediate file into the data warehouse. 8. The data transformation system of claim 1 , further comprising an index store configured to store the data from the retrieved objects in columnar form. 9. The data transformation system of claim 8 , wherein the export module is configured to generate row-based data from the stored data in the index store. 10. The data transformation system of claim 8 , wherein the schema inference module is configured to create a time index in the index store that maps time values to identifiers of the retrieved objects. 11. The data transformation system of claim 10 , wherein, for each retrieved object of the retrieved objects, the time value denotes at least one of (i) a transaction time corresponding to creation of the retrieved object or (ii) a valid time corresponding to the retrieved object. 12. The data transformation system of claim 8 , further comprising a write-optimized store configured to (i) cache additional objects for later storage in the index store and (ii) in response to a size of the cache reaching a threshold, package the additional objects together for bulk loading into the index store. 13. The data transformation system of claim 1 , wherein the schema inference module is configured to collect statistics on the metadata of the retrieved objects. 14. The data transformation system of claim 1 , wherein dynamically creating the cumulative schema further includes: determining that a particular field has a different data type for at least two of the retrieved objects; and selectively updating the cumulative schema to indicate that the particular field is type polymorphic. 15. The data transformation system of claim 1 , wherein the schema inference module is configured to, in response to the statistics on data types, recast the data of some of the retrieved objects. 16. The data transformation system of claim 1 , wherein the schema inference module is configured to, in response to in response to a determination that the data of at least one retrieved object is typed incorrectly, report the data of some of the retrieved objects to a user as potentially being typed incorrectly. 17. The data transformation system of claim 1 , wherein the schema inference module is configured to collect statistics on the data of the retrieved objects. 18. The data transformation system of claim 17 , wherein the statistics includes at least one of minimum, maximum, average, and standard deviation. 19. The data transformation system of claim 1 , further comprising a data collector module configured to receive relational data from the first data source and generate the objects for use by the schema inference module. 20. The data transformation system of claim 19 , wherein the data collector module is configured to eventize the relational data by creating (i) a first column indicating a table from which each item of the relational data is retrieved and (ii) a second column indicating a timestamp associated with each item of the relational data. 21. The data transformation system of claim 1 , further comprising a scheduling module configured to assign processing jobs to the schema inference module and the export module according to predetermined dependency information. 22. The data transformation system of claim 1 , wherein the export module is configured to partition the cumulative schema into multiple tables, wherein each of the multiple tables includes columns that appear together in the retrieved objects. 23. The data transformation system of claim 22 , wherein the export module is configured to partition the cumulative schema according to columns found in corresponding groups of the retrieved objects that each have a different value for an identifier element. 24. The data transformation system of claim 1 , wherein the schema inference module records a source identifier for each of the retrieved objects. 25. The data transformation system of claim 24 , wherein, for each object of the retrieved objects, the source identifier includes a unique identifier of the first data source and a position of the object within the first data source.

Assignees

Inventors

Classifications

  • G06F16/211Primary

    Schema design and management · CPC title

  • G06F16/86Primary

    Mapping to a database · CPC title

  • Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

  • Update request formulation · CPC title

  • Indexing; Data structures therefor; Storage structures · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9613068B2 cover?
A data transformation system includes a schema inference module and an export module. The schema inference module is configured to dynamically create a cumulative schema for objects retrieved from a first data source. Each of the retrieved objects includes (i) data and (ii) metadata describing the data. Dynamically creating the cumulative schema includes, for each object of the retrieved object…
Who is the assignee on this patent?
Amiato Inc, Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/211. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 04 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).