What technology area does this patent fall under?

Primary CPC classification G06F16/2471. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 07 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

System and method for distributed database query engines

US9361344B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9361344-B2
Application number	US-201514728966-A
Country	US
Kind code	B2
Filing date	Jun 2, 2015
Priority date	Jan 7, 2013
Publication date	Jun 7, 2016
Grant date	Jun 7, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for a system capable of performing low-latency database query processing are disclosed herein. The system includes a gateway server and a plurality of worker nodes. The gateway server is configured to divide a database query, for a database containing data stored in a distributed storage cluster having a plurality of data nodes, into a plurality of partial queries and construct a query result based on a plurality of intermediate results. Each worker node of the plurality of worker nodes is configured to process a respective partial query of the plurality of partial queries by scanning data related to the respective partial query that stored on at least one data node of the distributed storage cluster and generate an intermediate result of the plurality of intermediate results that is stored in a memory of that worker node.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: a gateway server configured to generate a plurality of partial queries from a database query for a database containing data stored in a distributed storage cluster that has a plurality of data nodes, and to construct a query result based on a plurality of intermediate results; and a plurality of worker nodes, the worker nodes being separate from the data nodes, wherein each worker node of the plurality of worker nodes is configured to process a respective partial query of the plurality of partial queries by scanning data related to the respective partial query and stored on at least one data node of the distributed storage cluster, and wherein each worker node of the plurality of worker nodes is further configured to generate an intermediate result of the plurality of intermediate results that is stored in a memory of that worker node; wherein the gateway server is further configured to identify a straggling worker node and further divide a partial query that is assigned to the straggling worker node into a plurality of subordinate partial queries and assign the plurality of subordinate partial queries to some of the plurality of worker nodes, and wherein the partial query is divided into the subordinate partial queries based on quantity and location information of input file blocks of the query. 2. The system of claim 1 , wherein each worker node of the plurality of worker nodes is further configured to process the respective partial query of the plurality of partial queries by scanning a portion of the data related to the respective partial query that is stored on the at least one data node of the distributed storage cluster and to generate an approximate intermediate result that is stored in the memory of that worker node. 3. The system of claim 2 , wherein the gateway server is further configured to construct an approximate query result based on at least one approximate intermediate result. 4. The system of claim 1 , wherein the gateway server is further configured to construct an approximate query result based on a portion of the plurality of intermediate results. 5. The system of claim 1 , wherein the straggling worker node is a worker node that either fails to report a rate of progress to the gateway server or reports the rate of progress below a specified value after a specified time period to the gateway server. 6. The system of claim 1 , wherein each worker node of the plurality of the worker nodes is a service running a respective data node within the distributed storage cluster. 7. The system of claim 1 , further comprising: a metadata cache configured to cache table level metadata of the database and file level metadata of the distributed storage cluster. 8. The system of claim 7 , wherein the metadata cache is configured to retain cached metadata from a previous database query for the database query. 9. The system of claim 1 , wherein each worker node of the plurality of the worker nodes periodically sends heartbeat messages to the gateway server to report status of a partial query processing by that worker node. 10. The system of claim 1 , wherein the gateway server is further configured to receive an instruction from a client device to return an approximate query result or terminate a processing of the database query. 11. The system of claim 1 , wherein the gateway server is further configured to instruct the worker nodes to immediately return approximate intermediate results, and to return an approximate query result based on the approximate intermediate results to a client device. 12. The system of claim 1 , wherein the database query includes a request for an approximate query result. 13. The system of claim 1 , wherein the query result is accompanied by an indication of a portion of related data stored in the data nodes that has been scanned for the query result. 14. The system of claim 1 , wherein the database is a Hive data warehouse system and the distributed storage cluster is a Hadoop cluster. 15. A method, comprising: receiving a database query from a client device, for a database containing data stored in a distributed storage cluster having a plurality of cluster nodes; dividing the database query into a plurality of partial queries; sending each of the partial queries to a respective worker node of a plurality of worker nodes, wherein each worker node is a service running on a memory of a cluster node of the distributed storage cluster; identifying a straggling worker node, dividing a partial query that is assigned to the straggling worker node into a plurality of subordinate partial queries, and assigning the plurality of subordinate partial queries to some of the plurality of worker nodes; retrieving a plurality of intermediate results for the partial queries from the worker nodes, wherein each intermediate result is processed by a respective worker node of the worker nodes by scanning related data stored in a cluster node on which the perspective worker node runs; and generating a query result based on the plurality of intermediate results:, wherein the partial query is divided into the subordinate partial queries based on quantity and location information of input file blocks of the query. 16. The method of claim 15 , wherein the step of identifying comprises: identifying a straggling worker node by monitoring heartbeat messages that the worker nodes periodically send, wherein the straggling worker node is identified when heartbeat messages from the straggling worker node are not received for a predetermined time period, or when a heartbeat message from the straggling worker node is received and the heartbeat message including a number representing a status of a partial query processing by the straggling worker node that is below a threshold value. 17. The method of claim 15 , further comprising: caching data associated with previous database queries for the database in a cache; retrieving a real-time feed of audit logs of the database to invalidate entries in the cached data stored in the cache that have been changed by the previous database queries; and purging entries in the cached data from the cache that have not been queried for a specified time period.

Assignees

Facebook Inc

Inventors

Classifications

G06F16/9535
Search customisation based on user profiles and personalisation · CPC title
G06F16/2471Primary
Distributed queries · CPC title
G06F16/951
Indexing; Web crawling techniques · CPC title
G06F16/24539
using cached or materialised query results · CPC title
G06F16/24552
Database cache management · CPC title

Patent family

Related publications grouped by family.

View patent family 49886706

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9361344B2 cover?: Techniques for a system capable of performing low-latency database query processing are disclosed herein. The system includes a gateway server and a plurality of worker nodes. The gateway server is configured to divide a database query, for a database containing data stored in a distributed storage cluster having a plurality of data nodes, into a plurality of partial queries and construct a que…
Who is the assignee on this patent?: Facebook Inc
What technology area does this patent fall under?: Primary CPC classification G06F16/2471. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 07 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).