Content-aware task assignment in distributed computing systems using de-duplicating cache
US-2016179581-A1 · Jun 23, 2016 · US
US9684689B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9684689-B2 |
| Application number | US-201514612802-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 3, 2015 |
| Priority date | Feb 3, 2015 |
| Publication date | Jun 20, 2017 |
| Grant date | Jun 20, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The system and method of the present disclosure relates to data stored in a common database of a network for parallel processing by multiple processors or processing centers. As consumers and business continue to generate more and more data, the amount of data being stored across networks and computing environments increases. To monitor and process increasingly large amounts of data, the system and method of the present disclosure utilizes the atomicity of certain databases and storage devices to efficiently identify and mark data for processing by a designated processor or processing center, such that the designated processor or processing center is responsible for processing the identified and marked data. Consequently, the system ensures that no two processors or processing centers are processing the same data at the same time without the use of schedulers, queues or other conventional techniques.
Opening claim text (preview).
What is claimed is: 1. A method of distributing processing jobs to multiple processing nodes of a distributed parallel processing system, comprising: accessing subsets of data from a data set stored in a common storage system by a respective one of the processing nodes; generating unique identifiers for each of the subsets of data, the unique identifiers generated by the respective one of the processing nodes having accessed the respective subset of data, the unique identifiers do not identify any of the multiple processing nodes; marking each of the subsets of data with the respective one of the unique identifiers by the respective one of the processing nodes, the marked subsets of data provided to the common storage system for updating the respective subsets of data and the respective one of the unique identifiers preventing other processing nodes from accessing the respective subset of data from the common storage system during processing; and individually identifying the updated subsets of data to be processed by the respective one of the processing nodes by matching the unique identifiers in the updated subsets of data to the unique identifiers in the marked subsets of data stored in the common storage system and generated by the respective one of the processing nodes, wherein the unique identifier is a random value that authenticates the marking in the subsets of data and wherein each of the respective processing nodes having successfully matched the unique identifiers performs distributed parallel processing jobs on the subsets of data. 2. The method of claim 1 , further comprising: fetching the updated subsets of data by each of the respective processing nodes when the unique identifiers have been successfully matched; processing the updated subsets of data by each of the respective processing nodes until completion of processing or release of the updated subsets of data; and releasing the updated subsets of data by each of the respective processing nodes when the unique identifier in the updated subsets of data respectively fail to match the unique identifier of the marked subsets of data generated by the respective one of the processing nodes, such that another one of the processing nodes may process the updated subsets of data. 3. The method of claim 2 , further comprising: updating the unique identifier in the updated subsets of data to indicate completion of the processing when processing of the updated subsets of data has been completed, and otherwise updating the unique identifier to indicate a current status of processing by the respective one of the processing nodes. 4. The method of claim 2 , wherein the common storage system is one of a server, database, data repository, memory, storage, data source and datastore. 5. The method of claim 2 , wherein, after a predetermined amount of time of inactivity by the one processing node, the subset of data is released to the another one of the processing nodes for processing. 6. The method of claim 1 , wherein the processing nodes are each a processor, processing entity or any combination thereof. 7. The method of claim 1 , wherein the individually identifying the updated subsets of data comprises: searching the subsets of data in the common storage system until the unique identifiers have been matched, and after identifying the matching unique identifiers, fetching the subsets of data for processing by the respective one of the processing nodes, the unique identifier is a random value generated by one of the processing nodes. 8. The method of claim 1 , wherein each subset of data stored in the common storage system is formatted as a table of data including rows and columns with one column added to each row including the unique identifier. 9. The method of claim 1 , wherein the common storage system has atomic characteristics. 10. The method of claim 1 , wherein the generated unique identifier is an attribute indicative of a type of data of the subset of data, such that the one processing node processes the subset of data when the attribute matches an attribute of the one processing node indicative of a type of data processing. 11. An apparatus to distribute and process data in a distributed and parallel processing environment, comprising: a common data source to store a dataset for parallel processing; a plurality of processing entities, including one or more processors, configured to receive different portions of the data set from the common data source for parallel processing in order to perform parallel jobs on the different portions of the dataset; and a first processing entity of the processing entities configured to receive a first portion of the dataset, the first processing entity configured to generate a first distinct identifier and a first status indicator, the first processing entity configured to mark the first portion of the dataset with the first distinct identifier and the first status indicator, the first processing entity configured to provide the marked first portion of the dataset to the common data source for updating the first portion of the dataset to reflect the marking added to the first portion of the dataset, the first processing entity configured to identify the updated first portion of the dataset stored in the common data source by comparing the marking to the generated first distinct identifier and the first status indicator, wherein the first identifier is a random value to authenticate the marking in the updated first portion of the dataset, and the first distinct identifier not identifying any of the processing entities, the first processing entity configured to process the updated first portion of the dataset by the first processing entity when the generated first distinct identifier and the first status indicator stored in the common data source respectively match the first distinct identifier and the first status indicator of the updated first portion of the dataset reflecting the marking such that no other of the processing entities processes the same updated first portion of the dataset at the same time. 12. The apparatus of claim 11 , the first processing entity configured to fetch the updated first portion of the dataset identified as having been successfully matched; the first processing entity configured to process the updated first portion of the dataset until completion of processing or release of the portion of the dataset; and the first processing entity configured to release the updated first portion of the dataset when the generated first distinct identifier and the first status indicator respectively fail to match the first distinct identifier and the first status indicator of the updated first portion of the dataset reflecting the marking, such that another of the processing entities processes the updated first portion of the dataset. 13. The apparatus of claim 12 , the first processing entity configured to update the first status indicator in the updated portion of the dataset to indicate completion of the processing when processing has been completed, and otherwise updating the status indicator to indicate a current status of processing. 14. The apparatus of claim 12 , wherein, after a predetermined amount of time of inactivity by the first processing entity, the first portion of the dataset is released to the another processing entity for processing. 15. The apparatus of claim 11 , wherein the first processing entity is a processor, processing center or any combination thereof. 16. The apparatus of claim 11 , wherein the first portion of the dataset stored in the com
Techniques for rebalancing the load in a distributed system · CPC title
the resource being the memory · CPC title
Updates performed during online database operations; commit processing · CPC title
Indexing; Data structures therefor; Storage structures · CPC title
the resource being a machine, e.g. CPUs, Servers, Terminals · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.