Method and system for processing data queries
US-9639575-B2 · May 2, 2017 · US
US2018373755A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2018373755-A1 |
| Application number | US-201715821361-A |
| Country | US |
| Kind code | A1 |
| Filing date | Nov 22, 2017 |
| Priority date | Feb 25, 2013 |
| Publication date | Dec 27, 2018 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Performing data analytics processing in the context of a large scale distributed system that includes a massively parallel processing (MPP) database and a distributed storage layer is disclosed. In various embodiments, a data analytics request is received. A plan is created to generate a response to the request. A corresponding portion of the plan is assigned to each of a plurality of distributed processing segments, including by invoking as indicated in the assignment one or more data analytical functions embedded in the processing segment.
Opening claim text (preview).
What is claimed is: 1 . A method, comprising: creating, by a master node, a plan to process a query; selecting one or more selected processing segments from among a plurality of distributed processing segments; sending, by the master node, to each of the one or more selected processing segments, a corresponding portion of the plan to be processed by that corresponding processing segment and metadata associated with the plan, wherein the metadata is used in connection with locating or accessing a subset of data on which the corresponding selected processing segment is to perform an indicated processing; receiving, from at least one of the one or more selected processing segments, a corresponding result of the portion of the plan processed by the corresponding processing segment; and generating, a master response to the query based at least in part on the corresponding is result of the portion of the plan received from the at least one of the one or more selected processing segments. 2 . The method of claim 1 , wherein the one or more selected processing segments correspondingly invoke one or more data analytical functions respectively embedded in the corresponding one or more selected processing segments. 3 . The method of claim 2 , wherein the one or more data analytical functions that are invoked are included in an assignment of the plan to the one or more selected processing segments by the master node. 4 . The method of claim 1 , wherein the metadata identifies a location data corresponding to one or more portions of the plan and at least a part of one or more data analytic processing to be performed in connection with processing the corresponding one or more portions of the plan. 5 . The method of claim 1 , wherein a request to process the query comprises one or more SQL statements. 6 . The method of claim 1 , wherein a request to process the query comprises one or more SQL statements to compute one or more of the following: Logistic Regression, Multinomial Logistic Regression, K-means clustering, Association Rules based market basket analysis, and Latent Dirichlet based topic modeling. 7 . The method of claim 1 , wherein a request to process the query is received at the master node, wherein the master node corresponds to a master node of a large scale distributed system. 8 . The method of claim 1 , wherein the creating of the plan to process the query includes creating a query plan, slicing the query plan into a plurality of slices, and identifying for each slice a group of processing segments to perform tasks comprising that slice of the query plan. 9 . The method of claim 1 , wherein each of the selected one or more processing segments is configured to use the metadata to access said data to be processed by that segment. 10 . The method of claim 1 , wherein a request to process the query is received at the master node, wherein the master node corresponds to a master node of a large scale distributed system, and the large scale distributed system comprises a distributed data storage layer comprising data stored in an instance of a Hadoop Distributed File System (HDFS) and the metadata indicates a location within the HDFS of data to be processed by the corresponding processing segment of is the one or more selected processing segments. 11 . The method of claim 1 , further comprising: embedding in each of the plurality of distributed processing segments a library or other shared object comprising one or more data analytical functions, wherein the library or other shared object is included in the processing segments as deployed. 12 . The method of claim 11 , wherein the library or other shared object embodies the one or more data analytical functions in the form of one or more of the following: compiled C++code, compiled Java, compiled Fortran, or other compiled code. 13 . The method of claim 1 , wherein the plurality of distributed processing segments comprise a subset of parallel processing segments comprising a massively parallel processing (MPP) database system. 14 . The method of claim 1 , wherein the metadata sent to each of the plurality of distributed processing segments for which a portion of the plan is assigned includes an identification of an embedded data analytics function to be used to process the portion of the plan. 15 . The method of claim 14 , wherein the embedded data analytics function includes a User-Defined function, a step function, or a final function of a User-Defined Aggregator. 16 . The method of claim 1 , wherein the metadata sent to each of the one or more selected processing segments is sent in conjunction with the corresponding portion of the plan to be performed by that selected processing segment. 17 . The method of claim 16 , wherein the metadata sent to each of the one or more selected processing segments is sent as part of the corresponding portion of the plan to be performed by that selected processing segment. 18 . The method of claim 1 , wherein at least a portion of the metadata sent to each of the one or more selected processing segments is obtained from a central metadata store. 19 . A system, comprising: a communication interface; and one or more processors coupled to the communication interface and configured to: create a plan to process a query; select one or more of selected processing segments from among a plurality of is distributed processing segments; send, to each of the one or more selected processing segments, a corresponding portion of the plan to be processed by that corresponding processing segment and metadata associated with the plan, wherein the metadata is used in connection with locating or accessing a subset of data on which the corresponding selected processing segment is to perform an indicated processing; receive, from at least one of the one or more selected processing segments, a corresponding result of the portion of the plan processed by the corresponding processing segment; and generate, a master response to the query based at least in part on the corresponding result of the portion of the plan received from the at least one of the one or more selected processing segments. 20 . A computer program product embodied in a tangible, non-transitory computer readable storage medium, comprising computer instructions for: creating a plan to processing a query; selecting one or more selected processing segments from among a plurality of distributed processing segments; sending, to each of the one or more selected processing segments, a corresponding portion of the plan to be processed by that corresponding processing segment and metadata, wherein the metadata is used in connection with locating or accessing a subset of data on which the corresponding selected processing segment is to perform an indicated processing; receiving, from at least one of the one or more selected processing segments, a corresponding result of processing the portion of the plan; and generating, a master response to the query based at least in part on the corresponding result of the portion of the plan received from the at least one of the one or more selected processing segments.
File system administration, e.g. details of archiving or snapshots (error detection or correction of the data by redundancy in operations G06F11/14) · CPC title
Distributed queries · CPC title
Details of archiving (lifecycle management in storage systems G06F3/0649; point-in-time backing up or restoration of persistent data G06F11/1446) · CPC title
Distributed file systems · CPC title
Query execution · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.