Data driven parallel sorting system and method

US9990412B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9990412-B2
Application numberUS-201414263138-A
CountryUS
Kind codeB2
Filing dateApr 28, 2014
Priority dateApr 28, 2013
Publication dateJun 5, 2018
Grant dateJun 5, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data driven parallel sorting method includes distributing input data records to n partitions one by one in a circular manner. Each partition corresponds to a parallel sorting process with an allocated memory chunk sized to store m data records. The method also includes sorting, in parallel, current data records in respective memory chunks in respective partitions. The method also includes in response to distribution of data records of └m/n┘ rounds, circularly controlling one of the n partitions, and writing data records that have been sorted in the memory chunk of the partition into a mass storage as an ordered data chunk, and emptying the memory chunk. The method also includes in response to all data records being distributed, writing data chunks that have been sorted in respective memory chunks into the mass storage, and performing a merge sort on all ordered data chunks in the mass storage.

First claim

Opening claim text (preview).

What is claimed is: 1. A data driven parallel sorting method utilizing a plurality of parallel sorting processes, the method comprising: distributing input data records to n partitions one by one in a circular manner, each partition corresponding to one of the parallel sorting processes and having an allocated memory chunk, said memory chunk sized to store m data records where n is an integer larger than 1 and m is a positive integer; sorting, in parallel, current data records in respective memory chunks in respective partitions; in response to data records of └m/n┘ rounds being distributed, circularly controlling one of said n partitions, and writing, at a calculated time, data records that have been sorted in the memory chunk of the one of said n partitions into a mass storage as an ordered data chunk even when the memory chunk is not full and emptying the memory chunk, wherein the calculated time produces a time difference based on when data is written to mass storage and computation intensity of one or more processors handling at least one of the plurality of parallel sorting processes; and in response to all data records being distributed, writing data chunks that have been sorted in respective memory chunks into the mass storage, and performing a merge sort on all ordered data chunks in the mass storage. 2. The method according to claim 1 , wherein the input data records correspond to streaming data. 3. The method according to claim 1 , wherein the circularly controlling one of said n partitions comprises: in response to data records of └m/n┘*k-th round being distributed, controlling the i=(k mod n)th partition, writing data records that have been sorted in the memory chunk of the partition into the mass storage as an ordered data chunk, where i is the number of partitions, 1≤i≤n, and k is a positive integer. 4. The method according to claim 3 , wherein for the 1st to the (n−1)th partitions, sizes of the ordered data chunks that are written into the mass storage for the first time are less than m data records; and for the nth partition, the size of the ordered data chunk that is written into the mass storage for the first time is less than or equal to m data records. 5. The method according to claim 3 , wherein when k is less than or equal to n, sizes of the ordered data chunks that are written into the mass storage are └m/n┘*k data records. 6. The method according to claim 3 , wherein when k is larger than n, sizes of the ordered data chunks that are written into the mass storage are m data records. 7. The method according to claim 1 , further comprising not writing last data chunks in respective partitions into the mass storage. 8. The method according to claim 1 , wherein at least two of the sorting processes compete for resources from the same processor of the one or more processors. 9. The method according to claim 1 , wherein the writing of ordered data chunks from respective partitions competes for a same I/O resource. 10. The method according to claim 1 , wherein the time difference is on sizes of the ordered data chunks written into the mass storage for the first time being of unequal size. 11. A data driven parallel sorting system utilizing a plurality of parallel sorting processes, the system comprising: one or more processors in communication with one or more types of memory, the one or more processors configured to: facilitate execution of a data distributor configured to circularly distribute input data records to n partitions one by one, each partition corresponding to one of the parallel sorting processes and having an allocated memory chunk, said memory chunk sized to store m data records, where n is an integer larger than 1 and m is a positive integer; facilitate execution of an in-partition sorter configured to sort current data records in respective memory chunks in parallel in respective partitions; facilitate execution of a controlled data dumper configured to, in response to data records of └m/n┘ rounds being distributed, circularly control one of said n partitions, write, at a calculated time, data records that have been sorted in the memory chunk of the one of said n partitions into a mass storage as an ordered data chunk even when the memory chunk is not full and empty the memory chunk, wherein the calculated time produces a time difference based on when data is written to mass storage and computation intensity of the one or more processors handling at least one of the plurality of parallel sorting processes; and facilitate execution of a merge sorter configured to, in response to distributing of all data records being completed, write data chunks that have been sorted in respective memory chunks into the mass storage, and to apply a merge sorting to all ordered data chunks in the mass storage. 12. The system according to claim 11 , wherein the one or more processors are further configured to facilitate the execution of the controlled data dumper to: in response to data records of └m/n┘*k-th round being distributed, control the i=(k mod n)th partition, writing data records that have been sorted in the memory chunk of the partition into the mass storage as an ordered data chunk, where i is the number of partitions, 1≤i≤n, and k is a positive integer. 13. The system according to claim 12 , wherein for the 1st to the (n−1)th partitions, sizes of the ordered data chunks that are written into the mass storage for the first time are less than m data records; and for the nth partition, the size of the ordered data chunk that is written into the mass storage for the first time is less than or equal to m data records. 14. The system according to claim 12 , wherein when k is less than or equal to n, sizes of the ordered data chunks that are written into the mass storage are └m/n┘*k data records. 15. The system according to claim 12 , when k is larger than n, sizes of the ordered data chunks that are written into the mass storage are m data records. 16. The system according to claim 11 , wherein the last data chunks in respective partitions are not written into the mass storage. 17. The system according to claim 11 , wherein at least two of sorting processes corresponding to respective partitions compete for resources from the same processor of the one or more processors. 18. The system according to claim 11 , wherein the time difference is on sizes of the ordered data chunks written into the mass storage for the first time being of unequal size. 19. A computer program product for parallel sorting, the computer program product comprising a non-transitory computer readable storage medium having program code embodied therewith, the program code executable by one or more processors to perform a method, the method comprising: circularly distribute input data records to n partitions one by one, each partition corresponding to one of a plurality of parallel sorting processes and having an allocated memory chunk, said memory chunk sized to store m data records where n is an integer larger than 1 and m is a positive integer; sorting, in parallel, current data records in respective memory chunks in respective partitions; in response to data records of └m/n┘ rounds being distributed, circularly controlling one of said n partitions, and writing, at a calculated time, data records that have been sorted in the memory chunk of the one of said n partitions into a mass storage as an ordered data chunk even when the memory chunk is not full and emptying the memory chunk, wherein the calculated time produces a time differe

Assignees

Inventors

Classifications

  • Combined merging and sorting · CPC title

  • G06F16/278Primary

    Data partitioning, e.g. horizontal or vertical partitioning · CPC title

  • Sorting, i.e. extracting data from one or more carriers, rearranging the data in numerical or other ordered sequence, and rerecording the sorted data on the original carrier or on a different carrier or set of carriers {sorting methods in general}(G06F7/36 takes precedence) · CPC title

  • Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

  • Unary operations; Data partitioning operations · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9990412B2 cover?
A data driven parallel sorting method includes distributing input data records to n partitions one by one in a circular manner. Each partition corresponds to a parallel sorting process with an allocated memory chunk sized to store m data records. The method also includes sorting, in parallel, current data records in respective memory chunks in respective partitions. The method also includes in …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/278. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 05 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).