Parallelizing SQL on distributed file systems

US10534770B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10534770-B2
Application numberUS-201415114328-A
CountryUS
Kind codeB2
Filing dateMar 31, 2014
Priority dateMar 31, 2014
Publication dateJan 14, 2020
Grant dateJan 14, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Example embodiments relate to parallelizing structured query language (SQL) on distributed file systems. In example embodiments, a subquery of a distributed file system is received from a query engine, where the subquery is one of multiple subqueries that are scheduled to execute on a cluster of server nodes. At this stage, a user defined function that comprises local, role-based functionality is executed, where the partitioned magic table triggers parallel execution of the user defined function. The execution of the UDF determines a sequence number based on a quantity of the cluster of server nodes and retrieve nonconsecutive chunks from a file of the distributed file system, where each of the nonconsecutive chunks is offset by the sequence number.

First claim

Opening claim text (preview).

We claim: 1. A first computing device for parallelizing structured query language (SQL) on distributed file systems, the first computing device comprising: a processor; and a memory storing instructions that when executed cause the processor to: receive a subquery from a query engine that requests data stored in a local file system of the first computing device, wherein the subquery is one of a plurality of subqueries of a query that are sent from the query engine to a plurality of computing devices, including the first computing device, and scheduled to execute on the plurality of computing devices; extract a parameter from the subquery; execute a User Defined Function (UDF) using the extracted parameter to determine a reference number that references a partitioned table, wherein the partitioned table is created and maintained by the query engine and comprises a plurality of reference numbers that are used to trigger the plurality of computing devices to execute the subqueries in parallel, wherein the UDF works independently of other UDFs without knowing a total size of an input stream; based on the reference number that references the partitioned table, execute the subquery in parallel with other computing devices of the plurality of computing devices; determine a sequence number for the first computing device based on a quantity of the plurality of computing devices; retrieve a plurality of nonconsecutive chunks from a file of the local file system of the first computing device, wherein each time one of the plurality of nonconsecutive chunks is retrieved from the file, the retrieved chunk is offset by the sequence number; and send the nonconsecutive chunks from the first computing device to the query engine, wherein the nonconsecutive chunks from the first computing device are combined with chunks from the other computing devices and provided to a requester as a result of the query. 2. The first computing device of claim 1 , wherein the instructions to cause the processor to retrieve the plurality of nonconsecutive chunks comprise instructions to cause the processor to: in response to determining that a first chunk of the plurality of nonconsecutive chunks includes a first new line character, retrieve a portion of the first chunk that begins after the first new line character. 3. The first computing device of claim 2 , wherein the instructions to cause the processor to retrieve the plurality of nonconsecutive chunks further comprise instructions to cause the processor to: in response to determining that a second chunk of the plurality of nonconsecutive chunks ends with an incomplete record, retrieve a portion of a consecutive chunk that follows the second chunk, wherein the portion of the consecutive chunk ends at a second new line character. 4. The first computing device of claim 1 , wherein a quantity of the reference numbers in the partitioned table equals the quantity of the plurality of computing devices. 5. The first computing device of claim 1 , wherein the sequence number for the first computing device is further determined based on the reference number. 6. The first computing device of claim 1 , wherein the computing devices are parts of a distributed file system and are external to the query engine. 7. A method for parallelizing structured query language (SQL) on distributed file systems, the method comprising: creating, by a processor of a query engine, a partitioned table to include a plurality of reference numbers, wherein the reference numbers are used to cause a plurality of computing devices in a distributed file system to execute in parallel with each other; dividing, by the processor of the query engine, a query into a plurality of subqueries, wherein each of the plurality of subqueries includes one of the reference numbers from the partitioned table, wherein the partitioned table simulates partition metadata for parallelizing execution of the plurality of subqueries on the plurality of computing devices; sending, by the processor, the plurality of subqueries to the plurality of computing devices, wherein the plurality of computing devices are to extract parameters from the plurality of subqueries, execute User Defined Functions (UDFs) using the extracted parameters to determine the reference numbers wherein the UDFs work independently of each other without knowing a total size of an input stream, execute the plurality of subqueries in parallel based on the reference numbers, retrieve a plurality of nonconsecutive chunks from local file systems of the respective computing devices, and send the nonconsecutive chunks to the query engine, wherein the nonconsecutive chunks are offset by a sequence number; and in response to receiving the plurality of chunks from the plurality of computing devices, combining, by the processor of the query engine, the plurality of chunks into a query result of the query. 8. The method of claim 7 , wherein a quantity of the reference numbers in the partitioned table equals a quantity of the plurality of computing devices. 9. The method of claim 7 , wherein the sequence number is determined based on the reference numbers and a quantity of plurality of computing devices. 10. The method of claim 7 , wherein the plurality of computing devices are external to the query engine. 11. A non-transitory machine-readable storage medium storing instructions executable by a processor of a first computing device to cause the first computing device to: receive a subquery from a query engine that requests data stored in a local file system of the first computing device, wherein the subquery is one of a plurality of subqueries of a query that are sent from the query engine to a plurality of computing devices, including the first computing device, and scheduled to execute on the plurality of computing devices; extract a parameter from the subquery; execute a User Defined Function (UDF) using the extracted parameter to determine a reference number that references a partitioned table, wherein the partitioned table is created and maintained by the query engine and comprises a plurality of reference numbers that are used to trigger the plurality of computing devices to execute the subqueries in parallel, wherein the UDF works independently of other UDFs without knowing a total size of an input stream; based on the reference number that references the partitioned table, execute the subquery in parallel with other computing devices of the plurality of computing devices; determine a sequence number based on the reference number and a quantity of the plurality of computing devices; retrieve a plurality of nonconsecutive chunks from a file in the local file system of the first computing device, wherein each of the plurality of nonconsecutive chunks is offset by the sequence number; and send the nonconsecutive chunks from the first computing device to the query engine, wherein the nonconsecutive chunks from the first computing device are combined with chunks from the other computing devices and provided to a requester as a result of the query. 12. The non-transitory machine-readable storage medium of claim 11 , wherein the instructions to cause the first computing device to retrieve the plurality of nonconsecutive chunks: in response to determining that a first chunk of the plurality of nonconsecutive chunks includes a first new line character, retrieve a portion of the first chunk that begins after the first new line character. 13. The non-transitory machine-readable storage medium of claim 12 , wherein the instructions to cause the first computing device to retrieve the plurality of nonconsecutive chunks further comp

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10534770B2 cover?
Example embodiments relate to parallelizing structured query language (SQL) on distributed file systems. In example embodiments, a subquery of a distributed file system is received from a query engine, where the subquery is one of multiple subqueries that are scheduled to execute on a cluster of server nodes. At this stage, a user defined function that comprises local, role-based functionality …
Who is the assignee on this patent?
Hewlett Packard Entpr Dev Lp, Micro Focus Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/24532. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 14 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).