Parallel Processing Of Data
US-2024338235-A1 · Oct 10, 2024 · US
US10534770B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10534770-B2 |
| Application number | US-201415114328-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 31, 2014 |
| Priority date | Mar 31, 2014 |
| Publication date | Jan 14, 2020 |
| Grant date | Jan 14, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Example embodiments relate to parallelizing structured query language (SQL) on distributed file systems. In example embodiments, a subquery of a distributed file system is received from a query engine, where the subquery is one of multiple subqueries that are scheduled to execute on a cluster of server nodes. At this stage, a user defined function that comprises local, role-based functionality is executed, where the partitioned magic table triggers parallel execution of the user defined function. The execution of the UDF determines a sequence number based on a quantity of the cluster of server nodes and retrieve nonconsecutive chunks from a file of the distributed file system, where each of the nonconsecutive chunks is offset by the sequence number.
Opening claim text (preview).
We claim: 1. A first computing device for parallelizing structured query language (SQL) on distributed file systems, the first computing device comprising: a processor; and a memory storing instructions that when executed cause the processor to: receive a subquery from a query engine that requests data stored in a local file system of the first computing device, wherein the subquery is one of a plurality of subqueries of a query that are sent from the query engine to a plurality of computing devices, including the first computing device, and scheduled to execute on the plurality of computing devices; extract a parameter from the subquery; execute a User Defined Function (UDF) using the extracted parameter to determine a reference number that references a partitioned table, wherein the partitioned table is created and maintained by the query engine and comprises a plurality of reference numbers that are used to trigger the plurality of computing devices to execute the subqueries in parallel, wherein the UDF works independently of other UDFs without knowing a total size of an input stream; based on the reference number that references the partitioned table, execute the subquery in parallel with other computing devices of the plurality of computing devices; determine a sequence number for the first computing device based on a quantity of the plurality of computing devices; retrieve a plurality of nonconsecutive chunks from a file of the local file system of the first computing device, wherein each time one of the plurality of nonconsecutive chunks is retrieved from the file, the retrieved chunk is offset by the sequence number; and send the nonconsecutive chunks from the first computing device to the query engine, wherein the nonconsecutive chunks from the first computing device are combined with chunks from the other computing devices and provided to a requester as a result of the query. 2. The first computing device of claim 1 , wherein the instructions to cause the processor to retrieve the plurality of nonconsecutive chunks comprise instructions to cause the processor to: in response to determining that a first chunk of the plurality of nonconsecutive chunks includes a first new line character, retrieve a portion of the first chunk that begins after the first new line character. 3. The first computing device of claim 2 , wherein the instructions to cause the processor to retrieve the plurality of nonconsecutive chunks further comprise instructions to cause the processor to: in response to determining that a second chunk of the plurality of nonconsecutive chunks ends with an incomplete record, retrieve a portion of a consecutive chunk that follows the second chunk, wherein the portion of the consecutive chunk ends at a second new line character. 4. The first computing device of claim 1 , wherein a quantity of the reference numbers in the partitioned table equals the quantity of the plurality of computing devices. 5. The first computing device of claim 1 , wherein the sequence number for the first computing device is further determined based on the reference number. 6. The first computing device of claim 1 , wherein the computing devices are parts of a distributed file system and are external to the query engine. 7. A method for parallelizing structured query language (SQL) on distributed file systems, the method comprising: creating, by a processor of a query engine, a partitioned table to include a plurality of reference numbers, wherein the reference numbers are used to cause a plurality of computing devices in a distributed file system to execute in parallel with each other; dividing, by the processor of the query engine, a query into a plurality of subqueries, wherein each of the plurality of subqueries includes one of the reference numbers from the partitioned table, wherein the partitioned table simulates partition metadata for parallelizing execution of the plurality of subqueries on the plurality of computing devices; sending, by the processor, the plurality of subqueries to the plurality of computing devices, wherein the plurality of computing devices are to extract parameters from the plurality of subqueries, execute User Defined Functions (UDFs) using the extracted parameters to determine the reference numbers wherein the UDFs work independently of each other without knowing a total size of an input stream, execute the plurality of subqueries in parallel based on the reference numbers, retrieve a plurality of nonconsecutive chunks from local file systems of the respective computing devices, and send the nonconsecutive chunks to the query engine, wherein the nonconsecutive chunks are offset by a sequence number; and in response to receiving the plurality of chunks from the plurality of computing devices, combining, by the processor of the query engine, the plurality of chunks into a query result of the query. 8. The method of claim 7 , wherein a quantity of the reference numbers in the partitioned table equals a quantity of the plurality of computing devices. 9. The method of claim 7 , wherein the sequence number is determined based on the reference numbers and a quantity of plurality of computing devices. 10. The method of claim 7 , wherein the plurality of computing devices are external to the query engine. 11. A non-transitory machine-readable storage medium storing instructions executable by a processor of a first computing device to cause the first computing device to: receive a subquery from a query engine that requests data stored in a local file system of the first computing device, wherein the subquery is one of a plurality of subqueries of a query that are sent from the query engine to a plurality of computing devices, including the first computing device, and scheduled to execute on the plurality of computing devices; extract a parameter from the subquery; execute a User Defined Function (UDF) using the extracted parameter to determine a reference number that references a partitioned table, wherein the partitioned table is created and maintained by the query engine and comprises a plurality of reference numbers that are used to trigger the plurality of computing devices to execute the subqueries in parallel, wherein the UDF works independently of other UDFs without knowing a total size of an input stream; based on the reference number that references the partitioned table, execute the subquery in parallel with other computing devices of the plurality of computing devices; determine a sequence number based on the reference number and a quantity of the plurality of computing devices; retrieve a plurality of nonconsecutive chunks from a file in the local file system of the first computing device, wherein each of the plurality of nonconsecutive chunks is offset by the sequence number; and send the nonconsecutive chunks from the first computing device to the query engine, wherein the nonconsecutive chunks from the first computing device are combined with chunks from the other computing devices and provided to a requester as a result of the query. 12. The non-transitory machine-readable storage medium of claim 11 , wherein the instructions to cause the first computing device to retrieve the plurality of nonconsecutive chunks: in response to determining that a first chunk of the plurality of nonconsecutive chunks includes a first new line character, retrieve a portion of the first chunk that begins after the first new line character. 13. The non-transitory machine-readable storage medium of claim 12 , wherein the instructions to cause the first computing device to retrieve the plurality of nonconsecutive chunks further comp
of parallel queries · CPC title
Distributed file systems · CPC title
Join operations · CPC title
Unary operations; Data partitioning operations · CPC title
Query execution · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.