Systems and methods to build and utilize a search infrastructure
US-9607049-B2 · Mar 28, 2017 · US
US10521396B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10521396-B2 |
| Application number | US-201614996627-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 15, 2016 |
| Priority date | Dec 31, 2012 |
| Publication date | Dec 31, 2019 |
| Grant date | Dec 31, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A region-based placement policy that can be used to achieve a better distribution of data in a clustered storage system is disclosed herein. The clustered storage system includes a master module to implement the region-based placement policy for storing one or more copies of a received data across many data nodes of the clustered storage system. When implementing the region-based placement policy, the master module splits the received data into one or more regions, where each region includes a contiguous portion of the received data. Further, for each of the plurality of regions, the master module stores complete copies of the region in a subset of the data nodes.
Opening claim text (preview).
What is claimed is: 1. A clustered storage system comprising: a memory; one or more processors; and a master module that is in communication with one or more of multiple data nodes and that facilitates storage of data in the multiple data nodes, wherein the master module is configured to, when executed by the one or more processors: receive client data from a client system, wherein the client data comprises a data table including multiple rows and multiple columns; divide at least a portion of the data table, comprising at least two consecutive rows of the multiple rows and including less than all of the multiple rows, into two or more data files such that each data item in the portion of the data table with a common first column identifier is in a first of the two or more data files and each data item in the portion of the data table with a common second column identifier is in a second of the two or more data files; store the two or more data files in a primary data node by sending first file creation requests corresponding to each of the two or more data files; and store a replica of the portion of the data table, including replicas of the two or more data files, in a secondary data node by sending second file creation requests corresponding to each of the two or more data files, wherein the replica of the portion of the data table is created by: determining two or more of the multiple data nodes to be used to store the replica of the portion of the data table, wherein each of two or more data nodes comprises a region server that manages data access requests from clients to a set of regions managed by the region server and that communicates with a distributed file system; and for each set of regions, creating data locality between congruent regions within the set of regions and an associated region server that manages data access to the congruent regions by co-locating the congruent regions on a same data node on which the region server is running. 2. The clustered storage system of claim 1 , wherein processes for selecting the primary data node and selecting the secondary data node are each performed according to a placement policy maintained in association with the master module; wherein the first file creation requests each identify the selected primary data node; and wherein the second file creation requests each identify the selected secondary data node. 3. The clustered storage system of claim 1 , wherein the primary data node is stored on a first rack and the secondary data node is stored on a second rack different from the first rack; and wherein at least the first rack includes at least two data nodes. 4. The clustered storage system of claim 1 further comprising a metadata node that stores a mapping of each of the two or more data files to a corresponding set of consecutive rows of the multiple rows. 5. The clustered storage system of claim 1 , wherein each of the two or more data files are HFiles that are configured to have a file format for use with a scalable storage and processing system distributed across multiple clusters of computing systems. 6. The clustered storage system of claim 1 , wherein storing the replica of the portion of the data table includes: processing, by the master module, a plurality of write requests, each write request corresponding to one or more of the data files associated with the portion of the data table. 7. The clustered storage system of claim 1 , wherein data in at least one data file of the two or more data files is stored as two or more data blocks in an associated data node. 8. The clustered storage system of claim 7 further comprising a name node configured to store an index that includes a mapping of each of the two or more data blocks to a corresponding one of the two or more data files. 9. The clustered storage system of claim 1 , wherein the primary data node and the secondary data node constitute at least a portion of a scalable storage and processing system distributed across multiple clusters of computing systems. 10. A non-transitory computer-readable storage medium storing instructions of a master module that, when executed by a computing system, cause the computing system to perform operations for storing data in a clustered storage system, the clustered storage system including a plurality of storage nodes, the operations comprising: receiving data to be stored in the clustered storage system; and for at least one selected region of a plurality of regions of the data, each region comprising consecutive rows of the received data, dividing the selected region into two or more data files such that each of multiple data item in the selected region with a common first column identifier is in a first of the two or more data files and each of the multiple data item in the selected region with a common second column identifier different from the common first column identifier is in a second of the two or more data files; storing a first replica of the selected region, including first replicas of the two or more data files, in a first data node; and storing a second replica of the selected region, including second replicas of the two or more data files, in a second data node different from the first data node; wherein each of the plurality of storage nodes comprises a region server that manages data access requests from clients to a region managed by the region server and that communicates with a distributed file system; and wherein storing the first replica comprises creating data locality between the selected region, a region that is congruent with the selected region, and an associated region server that manages data access to the selected region and the congruent region by co-locating the selected region and the congruent region on a same data node on which the region server is running. 11. The non-transitory computer-readable storage medium of claim 10 , wherein storing the first replica of the selected region is performed by sending a file creation request corresponding to each of the two or more data files, the file creation request configured for a scalable storage and processing system distributed across multiple clusters of computing systems. 12. The non-transitory computer-readable storage medium of claim 10 , wherein the operations further comprise storing a mapping of each of the two or more data files to a corresponding region. 13. The non-transitory computer-readable storage medium of claim 10 , wherein the operations further comprise receiving a status report indicating the success of storing the selected region in a corresponding storage node. 14. The non-transitory computer-readable storage medium of claim 10 , wherein storing the first replica of the selected region and storing the second replica of the selected region includes processing a plurality of sub-write requests, wherein each sub-write request corresponds to one of the two or more of the data files. 15. The non-transitory computer-readable storage medium of claim 10 , wherein each of the two or more data files is stored as a set of one or more data blocks in a storage node, wherein the set of one or more data blocks corresponding to each of the two or more data files does not overlap with the one or more data blocks corresponding to any other of the two or more data files. 16. A method comprising: receiving, via a master module executing on a processing unit, data comprising a data table to be stored; and for at least one selected part of the data table, dividing, based on data columns of the data table, the selected part of the data table into tw
Provision of network file services by network file servers, e.g. by using NFS, CIFS (network file access protocols H04L67/1097) · CPC title
Distributed indices · CPC title
Techniques for file synchronisation in file systems · CPC title
based on file chunks · CPC title
Indexing; Data structures therefor; Storage structures · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.