Efficient sorting of large data set with duplicate values

US2016378801A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016378801-A1
Application numberUS-201514750385-A
CountryUS
Kind codeA1
Filing dateJun 25, 2015
Priority dateJun 25, 2015
Publication dateDec 29, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are disclosed for sorting an input data set. A sort tool determines a distribution of values of a data set that includes a plurality of data records. The sort tool partitions the data set into a plurality of subsets based on the distribution. Each of the data records is inserted into one of the subsets based on a corresponding sort value of the data record. The sort tool identifies one or more of the subsets that contain at least two distinct sort values. In each of the identified subsets, the data records are sorted by a corresponding sort value of the data record.

First claim

Opening claim text (preview).

1 - 7 . (canceled) 8 . A computer program product, comprising: a computer-readable storage medium having computer-readable program code embodied therewith, the computer-readable program code configured to perform an operation comprising: determining a distribution of values of a data set that includes a plurality of data records, partitioning the data set into a plurality of subsets based on the distribution, wherein each of the data records is inserted into one of the subsets based on a corresponding sort value of the data record, identifying one or more of the subsets that contain at least two distinct sort values, and sorting, in each of the identified subsets, the data records by a corresponding sort value of the data record. 9 . The computer program product of claim 8 , wherein the operation further comprises: merging the plurality of subsets. 10 . The computer program product of claim 8 , wherein determining the distribution of values of the data set comprises: sampling the data set; and generating an equi-depth histogram of the plurality of data records, wherein the equi-depth histogram determines the distribution of the values of the data records, and wherein the equi-depth histogram determines an amount of the subsets. 11 . The computer program product of claim 8 , wherein partitioning the data records into the plurality of subsets based on the distribution comprises, for each subset: initializing a single value flag of the subset as true, wherein the single value flag set as true indicates that the corresponding sort values of the data records of the subset do not differ from one another; and for each data record inserted in the subset: evaluating the data record to determine whether a corresponding value of the data record is equal to a start key value of the subset, wherein the start key value is a corresponding value of an data record initially inserted into the subset. 12 . The computer program product of claim 11 , wherein identifying the one or more of the subsets comprises, for each data record inserted in the subset: upon determining that the corresponding value of the data record is not equal to the start key value, setting the single value flag of the subset as false. 13 . The computer program product of claim 8 , wherein the operation further comprises: partitioning the data records of the subset into a majority partition and a minority partition, wherein the majority partition contains the data records of the subset that correspond to a single sort value and wherein the minority partition contains the data records of the subset having a sort value that is not equal to the single sort value; and identifying the majority partition as having a single sort value. 14 . The computer program product of claim 8 , wherein a set for the sampling is determined based on database statistics. 15 . A system, comprising: a processor; and a memory storing a program, which, when executed on the processor, performs an operation comprising: determining a distribution of values of a data set that includes a plurality of data records, partitioning the data set into a plurality of subsets based on the distribution, wherein each of the data records is inserted into one of the subsets based on a corresponding sort value of the data record, identifying one or more of the subsets that contain at least two distinct sort values, and sorting, in each of the identified subsets, the data records by a corresponding sort value of the data record. 16 . The system of claim 15 , wherein the operation further comprises: merging the plurality of subsets. 17 . The system of claim 15 , wherein determining the distribution of values of the data set comprises: sampling the data set; and generating an equi-depth histogram of the plurality of data records, wherein the equi-depth histogram determines the distribution of the values of the data records, and wherein the equi-depth histogram determines an amount of the subsets. 18 . The computer program product of claim 8 , wherein partitioning the data records into the plurality of subsets based on the distribution comprises, for each subset: initializing a single value flag of the subset as true, wherein the single value flag set as true indicates that the corresponding sort values of the data records of the subset do not differ from one another; and for each data record inserted in the subset: evaluating the data record to determine whether a corresponding value of the data record is equal to a start key value of the subset, wherein the start key value is a corresponding value of an data record initially inserted into the subset. 19 . The system of claim 18 , wherein identifying the one or more of the subsets comprises, for each data record inserted in the subset: upon determining that the corresponding value of the data record is not equal to the start key value, setting the single value flag of the subset as false. 20 . The system of claim 15 , wherein the operation further comprises: partitioning the data records of the subset into a majority partition and a minority partition, wherein the majority partition contains the data records of the subset that correspond to a single sort value and wherein the minority partition contains the data records of the subset having a sort value that is not equal to the single sort value; and identifying the majority partition as having a single sort value.

Assignees

Inventors

Classifications

  • G06F7/22Primary

    Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc · CPC title

  • Sorting, i.e. extracting data from one or more carriers, rearranging the data in numerical or other ordered sequence, and rerecording the sorted data on the original carrier or on a different carrier or set of carriers {sorting methods in general}(G06F7/36 takes precedence) · CPC title

  • G06F16/214Primary

    Database migration support · CPC title

  • using data annotations, e.g. user-defined metadata · CPC title

  • Large Object storage; Management thereof · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016378801A1 cover?
Techniques are disclosed for sorting an input data set. A sort tool determines a distribution of values of a data set that includes a plurality of data records. The sort tool partitions the data set into a plurality of subsets based on the distribution. Each of the data records is inserted into one of the subsets based on a corresponding sort value of the data record. The sort tool identifies o…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F7/22. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Dec 29 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).