Database partitioning scheme evaluation and comparison
US-9779117-B1 · Oct 3, 2017 · US
US10223437B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10223437-B2 |
| Application number | US-201514634199-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 27, 2015 |
| Priority date | Feb 27, 2015 |
| Publication date | Mar 5, 2019 |
| Grant date | Mar 5, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method and apparatus for adaptive data repartitioning and adaptive data replication is provided. A data set stored in a distributed data processing system is partitioned by a first partitioning key. A live workload comprising a plurality of data processing commands is processed. While processing the live workload, statistical properties of the live workload are maintained. Based on the statistical properties of the live workload with respect to the data set, it is determined to replicate and/or repartition the data set by a second partitioning key. The replicated and/or repartitioned data set is partitioned by the second partitioning key.
Opening claim text (preview).
What is claimed is: 1. A method comprising: partitioning a data set stored in a distributed data processing system by a first partitioning key; processing a live workload comprising a plurality of data processing commands; while processing the live workload, maintaining statistical properties of the live workload and receiving a query; and while executing the query as part of the live workload: based on the statistical properties of the live workload with respect to the data set, determining to repartition the data set by a second partitioning key that is selected based on a percentage of the data set to transfer to repartition the data set by the second partitioning key; and repartitioning the data set in the distributed data processing system by the second partitioning key; wherein the method is performed by one or more computing devices. 2. The method of claim 1 , further comprising: partitioning a second data set stored in the distributed data processing system by a third partitioning key; based on the statistical properties of the live workload with respect to the second data set, determining to replicate the second data set in the distributed data processing system based on a fourth partitioning key; storing an additional copy of the second data set in the distributed data processing system, wherein the additional copy of the second data set is partitioned by the fourth partitioning key. 3. The method of claim 1 , wherein the first partitioning key is selected based on initial workload statistic values. 4. The method of claim 1 , wherein the first partitioning key is selected based on statistical properties of a sample workload. 5. The method of claim 1 , wherein determining to repartition the data set by a second partitioning key is further based on an association strength between the first partitioning key and the second partitioning key. 6. The method of claim 1 , wherein the statistical properties comprise a selectivity metric for one or more data processing commands in the live workload, wherein the selectivity metric is based on an average amount of the data set required by the one or more data processing commands. 7. The method of claim 1 , wherein the statistical properties comprise a projection metric for one or more data processing commands in the live workload, wherein the projection metric is based on an average amount of the data set required by the one or more data processing commands. 8. The method of claim 1 , wherein the statistical properties comprise a key frequency metric for a particular partitioning key of the data set, wherein the key frequency metric is based on a frequency of access of the data set by the particular partitioning key in the live workload. 9. The method of claim 1 , wherein the statistical properties comprise a table frequency metric for the data set, wherein the table frequency metric is based on a frequency of access of the data set in the live workload. 10. A method comprising: partitioning a data set stored in a distributed data processing system by a first partitioning key; processing a live workload comprising a plurality of data processing commands; while processing the live workload, maintaining statistical properties of the live workload and receiving a query; and while executing the query as part of the live workload: based on the statistical properties of the live workload with respect to the data set, determining to replicate the data set in the distributed data processing system based on a second partitioning key that is selected based on a percentage of the data set to transfer to repartition the data set by the second partitioning key; and storing an additional copy of the data set in the distributed data processing system, wherein the additional copy of the data set is partitioned by the second partitioning key; wherein the method is performed by one or more computing devices. 11. The method of claim 10 , wherein said determining to replicate the data set is further based on an available amount of memory. 12. A non-transitory computer-readable medium storing instructions which, when executed by one or more processors, cause: partitioning a data set stored in a distributed data processing system by a first partitioning key; processing a live workload comprising a plurality of data processing commands; while processing the live workload, maintaining statistical properties of the live workload and receiving a query; and while executing the query as part of the live workload: based on a query plan of the query and the statistical properties of the live workload with respect to the data set, determining to repartition the data set by a second partitioning key that is selected based on a percentage of the data set to transfer to repartition the data set by the second partitioning key; and repartitioning the data set in the distributed data processing system by the second partitioning key. 13. The non-transitory computer-readable medium of claim 12 , wherein the instructions further cause: partitioning a second data set stored in the distributed data processing system by a third partitioning key; based on the statistical properties of the live workload with respect to the second data set, determining to replicate the second data set in the distributed data processing system based on a fourth partitioning key; storing an additional copy of the second data set in the distributed data processing system, wherein the additional copy of the second data set is partitioned by the fourth partitioning key. 14. The non-transitory computer-readable medium of claim 12 , wherein the first partitioning key is selected based on initial workload statistic values. 15. The non-transitory computer-readable medium of claim 12 , wherein the first partitioning key is selected based on statistical properties of a sample workload. 16. The non-transitory computer-readable medium of claim 12 , wherein determining to repartition the data set by a second partitioning key is further based on an association strength between the first partitioning key and the second partitioning key. 17. The non-transitory computer-readable medium of claim 12 , wherein the statistical properties comprise a selectivity metric for one or more data processing commands in the live workload, wherein the selectivity metric is based on an average amount of the data set required by the one or more data processing commands. 18. The non-transitory computer-readable medium of claim 12 , wherein the statistical properties comprise a projection metric for one or more data processing commands in the live workload, wherein the projection metric is based on an average amount of the data set required by the one or more data processing commands. 19. The non-transitory computer-readable medium of claim 12 , wherein the statistical properties comprise a key frequency metric for a particular partitioning key of the data set, wherein the key frequency metric is based on a frequency of access of the data set by the particular partitioning key in the live workload. 20. The non-transitory computer-readable medium of claim 12 , wherein the statistical properties comprise a table frequency metric for the data set, wherein the table frequency metric is based on a frequency of access of the data set in the live workload. 21. A non-transitory computer-readable medium storing instructions which, when executed by one or more processors, cause: partitioning a data set stored in a distributed
Data partitioning, e.g. horizontal or vertical partitioning · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.