Incrementally improving clustering of cross partition data in a distributed data system
US-2021365407-A1 · Nov 25, 2021 · US
US11789902B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11789902-B2 |
| Application number | US-202218058331-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 23, 2022 |
| Priority date | May 22, 2020 |
| Publication date | Oct 17, 2023 |
| Grant date | Oct 17, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and systems are provided for improved access to rows of data in a distributed data system. Each data row is associated with a partition. Data rows are distributed in one or more files and an impure file includes data rows associated multiple partitions. A clustering set is generated from a plurality of impure files by selecting a candidate impure file based on file access activity metrics and one or more neighbor impure files. Data rows of the impure files included in the clustering set are sorted according to their respective associated partitions. A set of disjoint partition range files are generated based on the sorted data rows of the impure files included in the clustering set. Each file of the set of disjoint partition range files is transferred to a respective target partition.
Opening claim text (preview).
What is claimed is: 1. A system for improved access to rows of data, each data row associated with a partition of a plurality of partitions, the data rows distributed in one or more files, wherein a file including data rows associated with different partitions of the plurality of partitions is an impure file, the system comprising: at least one processor; a first compute node that includes first program code structured to cause the at least one processor to: select a subset of impure files from a plurality of impure files; and schedule a clustering task in a queue for sorting data rows of the subset of impure files; and a second compute node that includes second program code structured to cause the at least one processor to, independent of the first compute node: retrieve the clustering task from the queue; execute the sorting of the data rows of the subset of the impure files according to a respective associated partition of each of the data rows; generate a set of disjoint partition range files based on the sorting; and transfer each file of the disjoint partition range files to a respective target pure partition. 2. The system of claim 1 , wherein the first compute node and the second compute node each respond to user queries, wherein the selection and scheduling of the first compute node are executed in background threads of the first compute node and the retrieval, execution, generation, and transfer of the second compute node are executed in background threads of the second compute node. 3. The system of claim 1 , comprising a third compute node that includes third program code structured to cause the at least one processor to respond to user queries independent of the first compute node and the second compute node, wherein the first compute node and the second compute node do not respond to user queries. 4. The system of claim 1 , wherein execution of the clustering task is cancelled and rescheduled in response to interference of user queries involving the impure files present in the clustering set. 5. The system of claim 1 , wherein the first compute node, to select the subset of impure files, is configured to: select a candidate file from the plurality of impure files for inclusion in the subset, and select one or more neighbor files from the plurality of impure files for inclusion in the subset; wherein the candidate file is selected independent of the selection of the one or more neighbor files. 6. The system of claim 5 , wherein the first compute node is configured to select the candidate file based on: file access activity metrics for the plurality of impure files, analysis of a number of partitions associated with the plurality of impure files, or analysis of a number of impure files associated with a partition. 7. The system of claim 1 , wherein the first compute node is configured to determine a number of the plurality of impure files to include in the subset based on at least one of: a system load metric; memory constraints, or a predicted number of sorting iterations needed to reach zero remaining impure files or another convergence state. 8. A system for improved access to rows of data, each data row associated with a partition of a plurality of partitions, the data rows distributed in one or more files, wherein a file including data rows associated with different partitions of the plurality of partitions is an impure file, the system comprising: a processor; and a compute node that includes program code structured to cause the processor to execute a plurality of threads comprising: a selector thread that: selects a subset of impure files from a plurality of impure files; and schedules a clustering task in a queue for sorting data rows of the subset of impure files; and a clusterer thread that independent of the selector thread: retrieves the clustering task from the queue; executes the sorting of the data rows of the subset of the impure files according to a respective associated partition of each of the data rows; generates a set of disjoint partition range files based on the sorting; and transfers each file of the disjoint partition range files to a respective target pure partition. 9. The system of claim 8 , wherein a number of selector threads and a number of clusterer threads are configurable independently. 10. The system of claim 8 , further comprising: a query response compute node that responds to user queries by accessing data in the pure partition. 11. The system of claim 8 , wherein execution of the clustering task is cancelled and rescheduled in response to interference of user queries involving the impure files present in the clustering set. 12. The system of claim 8 , wherein the selector thread, to select the subset of impure files, is configured to: select a candidate file from the plurality of impure files for inclusion in the subset, and select one or more neighbor files from the plurality of impure files for inclusion in the subset; wherein the candidate file is selected independent of the selection of the one or more neighbor files. 13. The system of claim 12 , wherein the selector thread is configured to select the candidate file based on: file access activity metrics for the plurality of impure files, analysis of a number of partitions associated with the plurality of impure files, or analysis of a number of impure files associated with a partition. 14. The system of claim 8 , wherein the selector thread is configured to determine a number of the plurality of impure files to include in the subset based on at least one of: a system load metric; memory constraints, or a predicted number of sorting iterations needed to reach zero remaining impure files or another convergence state. 15. A method for improved access to rows of data, each data row associated with a partition of a plurality of partitions, the data rows distributed in one or more files, wherein a file including data rows associated with different partitions of the plurality of partitions is an impure file, the method comprising: selecting a subset of impure files from a plurality of impure files; scheduling a clustering task in a queue for sorting data rows of the subset of impure files; retrieving the clustering task from the queue; executing the clustering task to sort the data rows of the subset of the impure files according to a respective associated partition of each of the data rows; generating a set of disjoint partition range files based on the sorting; and transferring each file of the disjoint partition range files to a respective target pure partition. 16. The method of claim 15 , wherein said selecting and scheduling are executed in background threads of a first compute node and said retrieving, executing, generating, and transferring are executed in background threads of a second compute node. 17. The method of claim 16 , further comprising: responding, in a third compute node, to user queries independent of the first compute node and the second compute node, wherein the first compute node and the second compute node do not respond to user queries. 18. The method of claim 15 , wherein said executing the clustering task is cancelled and rescheduled in response to interference of user queries involving the impure files present in the clustering set. 19. The method of claim 15 , wherein said selecting comprises: selecting a candidate file from the plurality of impure files for inclusion in the subset, and selecting one or more neighbor files from th
File access structures, e.g. distributed indices (arrangements of input from, or output to, record carriers G06F3/06) · CPC title
Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor · CPC title
Clustering or classification · CPC title
Details of de-fragmentation performed by the file system (saving storage space on storage systems G06F3/0608; management of blocks in storage devices G06F3/064) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.