Efficient multi-part upload for a data warehouse

US9426219B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9426219-B1
Application numberUS-201314098912-A
CountryUS
Kind codeB1
Filing dateDec 6, 2013
Priority dateDec 6, 2013
Publication dateAug 23, 2016
Grant dateAug 23, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Data may be partitioned and uploaded in multiple parts in parallel to a data warehouse cluster in a data warehouse system. Data to be uploaded may be identified, and the partitions for the data may be determined at the storage client. The data may then be partitioned at the storage client. In various embodiments, no local partitions of the data may be maintained in persistent storage at the storage client. The partitioned data may then be sent in parallel to a data warehouse staging area in another network-based service that is implemented as part of a same network-based service implementing the data warehouse system. A request may then be sent to the data warehouse cluster to perform a multi-part upload from the staging area to the data warehouse cluster.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: a plurality of compute nodes implementing a network-based services platform; at least some compute nodes of the plurality of compute nodes configured to implement a data warehouse cluster as part of a data warehouse service provided by the network-based services platform, wherein the data warehouse cluster provides data storage among the at least some compute nodes according to a data distribution scheme; another one or more compute nodes of the plurality of compute nodes configured to implement an upload staging area for the data warehouse cluster, wherein the upload staging area is accessible to the data warehouse cluster as part of the network-based services platform; at least one other compute node of the plurality of compute nodes configured to provide a dynamic, multi-part upload module from the network-based services platform to a storage client of the data warehouse cluster; the dynamic, multi-part upload module, configured to: determine, at the storage client, a plurality of partitions for data maintained at the storage client to be uploaded to the data warehouse cluster according to the data distribution scheme for the at least some compute nodes in the data warehouse cluster; dynamically partition the data at the storage client according to the determined plurality of partitions; send the dynamically partitioned data from the storage client to the upload staging area for the data warehouse cluster; and subsequent to said sending the partitioned data, send, from the storage client, an upload request to the data warehouse cluster in order to upload the plurality of partitions of the data from the upload staging area to respective ones of the at least some compute nodes in the data warehouse cluster; at least some compute nodes of the plurality of compute nodes of the network-based services platform, in response to receipt of the upload request, upload respective partitions of the plurality of partitions of the data in parallel from the upload staging area to respective ones of the at least some compute nodes in the data warehouse cluster. 2. The system of claim 1 , wherein to determine the plurality of partitions for the data maintained at the storage client to be uploaded to the data warehouse cluster according to the data distribution scheme for the at least some compute nodes in the data warehouse cluster, the dynamic, multi-part upload module is configured to: based, at least in part, on the data distribution scheme, identify a number of partitions for the data; and evaluate the data in order to determine partition boundaries corresponding to the number of partitions such that data objects maintained in each of the plurality of partitions remain intact. 3. The system of claim 1 , wherein said dynamically partitioning the data at the storage client according to the determined plurality of partitions is performed in system memory of the storage client. 4. The system of claim 1 , wherein the at least one other compute node is further configured to: receive a multi-part upload request for the data from the storage client; wherein said providing the dynamic, multi-part upload module to the storage client of the data warehouse cluster is performed in response to receiving the multi-part upload request. 5. The system of claim 1 , wherein the dynamic, multi-part upload module is provided to the storage client asynchronously. 6. A method, comprising: performing, by one or more computing devices implementing a storage client: identifying, at the storage client, data to be uploaded to a data warehouse cluster from the storage client, wherein the data warehouse cluster provides data storage among a plurality of compute nodes according to a data distribution scheme; determining, at the storage client, a plurality of partitions for the data according to the data distribution scheme for the plurality of compute nodes in the data warehouse cluster; partitioning, at the storage client, the data according to the determined plurality of partitions; sending, from the storage client, the partitioned data in parallel to an upload staging area that is accessible to the plurality of compute nodes of the data warehouse cluster; and subsequent to said sending the partitioned data, sending, from the storage client, an upload request to the data warehouse cluster for a multi-part upload of the partitioned data from the upload staging area to respective ones of the plurality of compute nodes in the data warehouse cluster. 7. The method of claim 6 , wherein said determining the plurality of partitions for the data according to the data distribution scheme for the plurality of compute nodes in the data warehouse cluster comprises evaluating the data to be uploaded in order to determine partition boundaries for the plurality of partitions such that data objects maintained in each of the plurality of partitions remain intact. 8. The method of claim 6 , wherein said partitioning the data according to the determined plurality of partitions comprises generating a compressed version of each of the plurality of partitions. 9. The method of claim 6 , wherein said partitioning the data according to the determined plurality of partitions comprises generating an encrypted version of each of the plurality of partitions. 10. The method of claim 6 , wherein said partitioning the data according to the determined plurality of partitions is performed in system memory at the storage client such that the plurality of partitions of the data are generated without creating local copies of the plurality of partitions in persistent storage at the storage client. 11. The method of claim 6 , wherein the data warehouse cluster is one of a plurality of data warehouse clusters that together implement a data warehouse service; wherein the method further comprises: performing, by another one or more computing devices implementing a control interface for the data warehouse service: receiving a multi-part upload request for the data from the storage client, wherein the multi-part upload request specifies the data warehouse cluster out of the plurality of data warehouse clusters; and in response to receiving the multi-part upload request, sending a dynamic, multi-part upload module to the storage client that is configured to perform said identifying the data to be uploaded, said determining the plurality of partitions for the data, said partitioning the data, said sending the partitioned data, and said sending the upload request. 12. The method of claim 6 , wherein said sending the partitioned data in parallel to the upload staging area comprises: sending the partitioned data to one or more additional network-based services that do not implement the upload staging area for further processing of the partitioned data, and wherein the partitioned data is further sent from the one or more additional network-based services to the upload staging area accessible to the plurality of compute nodes of the data warehouse cluster. 13. The method of claim 6 , wherein one of the plurality of compute nodes is a leader node configured to store data received at the leader node among the plurality of compute nodes in the data warehouse cluster according to the data distribution scheme, and wherein the method further comprises: identifying other data to be uploaded to the data warehouse cluster; and sending the other data to the leader node of the data warehouse cluster for storage among the plurality of compute nodes. 14. A non-transitory, computer-readable storage medium, comprising program instructions that implement a client

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS] · CPC title

  • Electricity · mapped topic

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9426219B1 cover?
Data may be partitioned and uploaded in multiple parts in parallel to a data warehouse cluster in a data warehouse system. Data to be uploaded may be identified, and the partitions for the data may be determined at the storage client. The data may then be partitioned at the storage client. In various embodiments, no local partitions of the data may be maintained in persistent storage at the sto…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification H04L67/1097. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Aug 23 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).