Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06F16/13. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 17 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Incrementally improving clustering of cross partition data in a distributed data system

US11789902B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11789902-B2
Application number	US-202218058331-A
Country	US
Kind code	B2
Filing date	Nov 23, 2022
Priority date	May 22, 2020
Publication date	Oct 17, 2023
Grant date	Oct 17, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems are provided for improved access to rows of data in a distributed data system. Each data row is associated with a partition. Data rows are distributed in one or more files and an impure file includes data rows associated multiple partitions. A clustering set is generated from a plurality of impure files by selecting a candidate impure file based on file access activity metrics and one or more neighbor impure files. Data rows of the impure files included in the clustering set are sorted according to their respective associated partitions. A set of disjoint partition range files are generated based on the sorted data rows of the impure files included in the clustering set. Each file of the set of disjoint partition range files is transferred to a respective target partition.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for improved access to rows of data, each data row associated with a partition of a plurality of partitions, the data rows distributed in one or more files, wherein a file including data rows associated with different partitions of the plurality of partitions is an impure file, the system comprising: at least one processor; a first compute node that includes first program code structured to cause the at least one processor to: select a subset of impure files from a plurality of impure files; and schedule a clustering task in a queue for sorting data rows of the subset of impure files; and a second compute node that includes second program code structured to cause the at least one processor to, independent of the first compute node: retrieve the clustering task from the queue; execute the sorting of the data rows of the subset of the impure files according to a respective associated partition of each of the data rows; generate a set of disjoint partition range files based on the sorting; and transfer each file of the disjoint partition range files to a respective target pure partition. 2. The system of claim 1 , wherein the first compute node and the second compute node each respond to user queries, wherein the selection and scheduling of the first compute node are executed in background threads of the first compute node and the retrieval, execution, generation, and transfer of the second compute node are executed in background threads of the second compute node. 3. The system of claim 1 , comprising a third compute node that includes third program code structured to cause the at least one processor to respond to user queries independent of the first compute node and the second compute node, wherein the first compute node and the second compute node do not respond to user queries. 4. The system of claim 1 , wherein execution of the clustering task is cancelled and rescheduled in response to interference of user queries involving the impure files present in the clustering set. 5. The system of claim 1 , wherein the first compute node, to select the subset of impure files, is configured to: select a candidate file from the plurality of impure files for inclusion in the subset, and select one or more neighbor files from the plurality of impure files for inclusion in the subset; wherein the candidate file is selected independent of the selection of the one or more neighbor files. 6. The system of claim 5 , wherein the first compute node is configured to select the candidate file based on: file access activity metrics for the plurality of impure files, analysis of a number of partitions associated with the plurality of impure files, or analysis of a number of impure files associated with a partition. 7. The system of claim 1 , wherein the first compute node is configured to determine a number of the plurality of impure files to include in the subset based on at least one of: a system load metric; memory constraints, or a predicted number of sorting iterations needed to reach zero remaining impure files or another convergence state. 8. A system for improved access to rows of data, each data row associated with a partition of a plurality of partitions, the data rows distributed in one or more files, wherein a file including data rows associated with different partitions of the plurality of partitions is an impure file, the system comprising: a processor; and a compute node that includes program code structured to cause the processor to execute a plurality of threads comprising: a selector thread that: selects a subset of impure files from a plurality of impure files; and schedules a clustering task in a queue for sorting data rows of the subset of impure files; and a clusterer thread that independent of the selector thread: retrieves the clustering task from the queue; executes the sorting of the data rows of the subset of the impure files according to a respective associated partition of each of the data rows; generates a set of disjoint partition range files based on the sorting; and transfers each file of the disjoint partition range files to a respective target pure partition. 9. The system of claim 8 , wherein a number of selector threads and a number of clusterer threads are configurable independently. 10. The system of claim 8 , further comprising: a query response compute node that responds to user queries by accessing data in the pure partition. 11. The system of claim 8 , wherein execution of the clustering task is cancelled and rescheduled in response to interference of user queries involving the impure files present in the clustering set. 12. The system of claim 8 , wherein the selector thread, to select the subset of impure files, is configured to: select a candidate file from the plurality of impure files for inclusion in the subset, and select one or more neighbor files from the plurality of impure files for inclusion in the subset; wherein the candidate file is selected independent of the selection of the one or more neighbor files. 13. The system of claim 12 , wherein the selector thread is configured to select the candidate file based on: file access activity metrics for the plurality of impure files, analysis of a number of partitions associated with the plurality of impure files, or analysis of a number of impure files associated with a partition. 14. The system of claim 8 , wherein the selector thread is configured to determine a number of the plurality of impure files to include in the subset based on at least one of: a system load metric; memory constraints, or a predicted number of sorting iterations needed to reach zero remaining impure files or another convergence state. 15. A method for improved access to rows of data, each data row associated with a partition of a plurality of partitions, the data rows distributed in one or more files, wherein a file including data rows associated with different partitions of the plurality of partitions is an impure file, the method comprising: selecting a subset of impure files from a plurality of impure files; scheduling a clustering task in a queue for sorting data rows of the subset of impure files; retrieving the clustering task from the queue; executing the clustering task to sort the data rows of the subset of the impure files according to a respective associated partition of each of the data rows; generating a set of disjoint partition range files based on the sorting; and transferring each file of the disjoint partition range files to a respective target pure partition. 16. The method of claim 15 , wherein said selecting and scheduling are executed in background threads of a first compute node and said retrieving, executing, generating, and transferring are executed in background threads of a second compute node. 17. The method of claim 16 , further comprising: responding, in a third compute node, to user queries independent of the first compute node and the second compute node, wherein the first compute node and the second compute node do not respond to user queries. 18. The method of claim 15 , wherein said executing the clustering task is cancelled and rescheduled in response to interference of user queries involving the impure files present in the clustering set. 19. The method of claim 15 , wherein said selecting comprises: selecting a candidate file from the plurality of impure files for inclusion in the subset, and selecting one or more neighbor files from th

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06F16/13Primary
File access structures, e.g. distributed indices (arrangements of input from, or output to, record carriers G06F3/06) · CPC title
G06F16/27
Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor · CPC title
G06F16/285
Clustering or classification · CPC title
G06F16/1724Primary
Details of de-fragmentation performed by the file system (saving storage space on storage systems G06F3/0608; management of blocks in storage devices G06F3/064) · CPC title

Patent family

Related publications grouped by family.

View patent family 75870780

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11789902B2 cover?: Methods and systems are provided for improved access to rows of data in a distributed data system. Each data row is associated with a partition. Data rows are distributed in one or more files and an impure file includes data rows associated multiple partitions. A clustering set is generated from a plurality of impure files by selecting a candidate impure file based on file access activity metri…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06F16/13. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 17 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).