Sharing a partitioned data set across parallel applications
US-2016070608-A1 · Mar 10, 2016 · US
US11210279B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11210279-B2 |
| Application number | US-201715487391-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 13, 2017 |
| Priority date | Apr 15, 2016 |
| Publication date | Dec 28, 2021 |
| Grant date | Dec 28, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A distributed offline indexing system uses a set of data processing systems in a distributed computing environment to create an index database that can store, for example, data about geographical or geospatial areas. The index database can be distributed across a plurality of database shards such that each shard includes an index file and a DB file. The index files include keys that refer to values in their corresponding DB files. The keys are used to look-up their corresponding values at search time. At indexing time, the keys are hashed, with an entropy creating hash, to distribute the keys across the shards.
Opening claim text (preview).
What is claimed is: 1. A non-transitory machine readable medium storing executable instructions which when executed by one or more data processing systems that implement a partitioned database in a distributed computing environment, cause the one or more data processing systems to perform a method comprising: receiving a plurality of inputs, each input comprising a key and value pair to be stored in a partition of a plurality, N, of partitions of the partitioned database, where in each partition comprises a database shard having a Bloom filter that stores values representing geospatial tiles or regions in a map; computing, for each key, a single hash value using a hash function modulo N, the hash function configured to provide entropy across hash values for different keys to distribute the keys substantially evenly across the plurality of N partitions; determining, based on the hash value of each key, modulo N, a partition identifier and mapping the corresponding key to the determined partition identifier; for each partition identifier having one or more keys mapped to the partition having the partition identifier: distributing the one or more keys and their paired values to the partition having the partition identifier; performing, separately within each partition on a data processing system dedicated to processing the partition; sorting the one or more keys distributed to the partition into an ordering to provide a sorted set of keys mapped to the partition; storing the sorted keys into an index file and storing their paired values in a database file, wherein each key refers to its paired value in the database file within the partition; wherein the processing operations for the partition further include processing one or more queries directed to one or more keys that, when hashed modulo N, map to the partition; wherein a key and the value associated with the key are accessed by hashing the key modulo N to determine the partition associated with the key and sending the key and an access request to the determined partition; and in response to determining that the key is not present in a tile of the Bloom filter for the partition, determining that the key is not present in the partition, such that no central index or central mapping of keys to partitions is maintained in the distributed computing system. 2. The medium as in claim 1 wherein each database shard is an append-only file. 3. The medium as in claim 1 wherein the keys represent different geospatial areas and keys for at least one city or geospatial region are dispersed across most, or all shards, due to the entropy. 4. The medium as in claim 3 wherein the entropy balances an indexing workload and storage usage across all of the data processing systems that process all of the shards and wherein the entropy disperses keys for one geospatial area across a plurality of shards. 5. The medium as in claim 1 further comprising: receiving a query against the partitioned database; determining one or more keys derived from the query; for each key of the one or more keys: determining a partition identifier corresponding to the key by hashing the key, modulo N; transmitting the key to the partition having the partition identifier to query the partition for a value corresponding to the key; and returning a value corresponding to the key, in response to determining that the key matches a key in the index file stored in the partition. 6. The medium as in claim 1 wherein each partition has a set of one or more data processing systems dedicated to sorting and storing within the partition and wherein determining partition identifiers is performed by a single data processing system. 7. The medium as in claim 1 wherein accessing a value having a key in a determined partition includes: determining whether the key is present in a tile or region of a Bloom filter for the partition, and in response to the key being present in the tile or region of the Bloom filter, searching the index file of the partition for the key, and returning the value corresponding to the key, otherwise: returning a null value. 8. The medium as in claim 1 wherein the values stored in the Bloom filter are quadtree keys representing tiles in a quadtree layout that represents a geospatial area. 9. The medium as in claim 8 wherein the quadtree keys are represented in base 10 values with a shift offset to provide unicity across all nodes in the quadtree layout. 10. The medium of claim 1 , wherein the hash function is a secure hash function (“SHA”). 11. The medium of claim 1 , wherein generating and indexing the partitioned database are performed offline. 12. A method practiced one or more data processing systems that implement a partitioned database in a distributed computing environment, the method comprising: receiving a plurality of inputs, each input comprising a key and value pair to be stored in a partition of a plurality, N, of partitions of the partitioned database, wherein each partition comprises a database shard having a Bloom filter that stores values representing geospatial tiles or regions in a map; computing, for each key, a single hash value using a hash function, modulo N, the hash function configured to provide entropy across hash values for different keys to distribute the keys substantially evenly across the plurality of N partitions; determining, based on the hash value of each key modulo N, a partition identifier and mapping the corresponding key to the determined partition identifier; for each partition identifier having one or more keys mapped to the partition identifier: distributing the one or more keys and their paired values to the partition having the partition identifier; performing, separately within each partition on a data processing system dedicated to processing the partition; sorting the one or more keys distributed to the partition into an ordering to provide a sorted set of keys mapped to the partition; storing the sorted keys into an index file and storing their paired values, in database file in the partition, wherein each key refers to its paired value in the database file within the partition; wherein the processing operations for the partition further include processing one or more queries directed to one or more keys that, when hashed modulo N, map to the partition; wherein a key and the value associated with the key are accessed by hashing the key modulo N to determine the partition associated with the key and sending the key and an access request to the determined partition; and in response to determining that the key is not present in a tile of the Bloom filter for the partition, determining that the key is not present in the partition, and no central index or mapping of keys to partitions is maintained in the distributed computing system. 13. The method as in claim 12 wherein each database shard is an append-only file. 14. The method as in claim 12 wherein the keys represent different geospatial areas and keys for at least one city or geospatial region are dispersed across most, or all, shards due to the entropy. 15. The method as in claim 14 wherein the entropy balances an indexing workload and storage usage across all of the data processing systems that process all of the shards and wherein the entropy disperses keys for one geospatial area across a plurality of shards. 16. The method as in claim 12 further comprising: receiving a query against the partitioned database; determining one or more keys derived from the query; for each key of the one or more keys: determining a partition i
Data partitioning, e.g. horizontal or vertical partitioning · CPC title
Hash tables · CPC title
Sorting, i.e. extracting data from one or more carriers, rearranging the data in numerical or other ordered sequence, and rerecording the sorted data on the original carrier or on a different carrier or set of carriers {sorting methods in general}(G06F7/36 takes precedence) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.