Distributed, multi-model, self-learning platform for machine learning
US-2016132787-A1 · May 12, 2016 · US
US11276013B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11276013-B2 |
| Application number | US-201816146907-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 28, 2018 |
| Priority date | Mar 31, 2016 |
| Publication date | Mar 15, 2022 |
| Grant date | Mar 15, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and apparatuses for training model based on random forest are provided. The method includes: dividing worker nodes into one or more groups; performing random sampling, by worker nodes in each group, in the preset sample data to obtain the target sample data; and training, by the worker nodes in each group, one or more decision tree objects using the target sample data. Example embodiments of the present disclosure do not need to scan the complete sample data for once, thereby greatly reducing the amount of data to be read, the time cost, and further the iterative update time of the model. The efficiency of training is improved.
Opening claim text (preview).
What is claimed is: 1. A method for training a model based on random forest, comprising: dividing, by a computer system, computer worker nodes of the computer system into a first group of first computer worker nodes and a second group of second computer worker nodes; obtaining, by each of the first computer worker nodes, a subset of sample data; distributing, by each of the first computer worker nodes, the obtained subset of sample data to the second computer worker nodes based on random sampling; obtaining, by each of one or more of the second computer worker nodes, target sample data from the random sampling; and training, by each of the one or more second computer worker nodes, one or more decision tree objects of the model based on random forest using the target sample data, wherein the training comprises: calculating a frequency of a value of attribute information of each sample object of the target sample data with respect to a classification column, wherein the classification column of the attribute information is dichotomous; normalizing the frequency to obtain a weight of the value of the attribute information of each sample object of the target sample data; sorting the value of the attribute information according to the weight; calculating a Gini coefficient using the sorted value of the attribute information; and performing a splitting process on a tree node of a decision tree object of the one or more decision tree objects according to the Gini coefficient. 2. The method according to claim 1 , wherein the first group of first computer worker nodes are Map nodes and the second group of second computer worker nodes are Reduce nodes. 3. The method according to claim 1 , wherein distributing the obtained subset of sample data to the second computer worker nodes based on random sampling comprises: for each of the second group of second computer worker nodes, distributing or not distributing the obtained subset of sample data in a random manner. 4. The method according to claim 1 , wherein the training, by each of the one or more second computer worker nodes, one or more decision tree objects of the model based on random forest using the target sample data includes: training, by each of the one or more second computer worker nodes, a decision tree of a random forest of the model using the target sample data. 5. The method according to claim 1 , wherein the calculating the Gini coefficient using the sorted value of the attribute information includes: dividing the sorted value of the attribute information into two attribute subsets according to an order of the sorting; and calculating the Gini coefficient using the two attribute subsets in sequence. 6. A computer system, comprising: one or more processors; and one or more memories storing thereon computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: dividing computer worker nodes of the computer system into a first group of first computer worker nodes and a second group of second computer worker nodes; obtaining, at each of the first computer worker nodes, a subset of sample data; distributing, at each of the first computer worker nodes, the obtained subset of sample data to the second computer worker nodes based on random sampling; obtaining, at each of one or more of the second computer worker nodes, target sample data from the random sampling; and training, at each of the one or more second computer worker nodes, one or more decision tree objects of a model based on random forest using the target sample data, wherein the training comprises: calculating a frequency of a value of attribute information of each sample object of the target sample data with respect to a classification column, wherein the classification column of the attribute information is dichotomous; normalizing the frequency to obtain a weight of the value of the attribute information of each sample object of the target sample data; sorting the value of the attribute information according to the weight; calculating a Gini coefficient using the sorted value of the attribute information; and performing a splitting process on a tree node of a decision tree object of the one or more decision tree objects according to the Gini coefficient. 7. The computer system according to claim 6 , wherein the first group of first computer worker nodes are Map nodes and the second group of second computer worker nodes are Reduce nodes. 8. The computer system according to claim 6 , wherein distributing the obtained subset of sample data to the second computer worker nodes based on random sampling comprises: for each of the second group of second computer worker nodes, distributing or not distributing the obtained subset of sample data in a random manner. 9. The computer system according to claim 6 , wherein: the computer system is a single computer, and each of the first group of first computer worker nodes and the second group of second computer worker nodes is a core of a Central Processing Unit; or the computer system is a computer cluster, and each of the first group of first computer worker nodes and the second group of second computer worker nodes is a single computer. 10. The method according to claim 1 , wherein: the computer system is a single computer, and each of the first group of first computer worker nodes and the second group of second computer worker nodes is a core of a Central Processing Unit; or the computer system is a computer cluster, and each of the first group of first computer worker nodes and the second group of second computer worker nodes is a single computer. 11. One or more non-transitory computer-readable storage media of a computer system, the non-transitory computer-readable storage media storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: dividing computer worker nodes of the computer system into a first group of first computer worker nodes and a second group of second computer worker nodes; obtaining, at each of the first computer worker nodes, a subset of sample data; distributing, at each of the first computer worker nodes, the obtained subset of sample data to the second computer worker nodes based on random sampling; obtaining, at each of one or more of the second computer worker nodes, target sample data from the random sampling; and training, at each of the one or more second computer worker nodes, one or more decision tree objects of a model based on random forest using the target sample data, wherein the training comprises: calculating a frequency of a value of attribute information of each sample object of the target sample data with respect to a classification column, wherein the classification column of the attribute information is dichotomous; normalizing the frequency to obtain a weight of the value of the attribute information of each sample object of the target sample data; sorting the value of the attribute information according to the weight; calculating a Gini coefficient using the sorted value of the attribute information; and performing a splitting process on a tree node of a decision tree object of the one or more decision tree objects according to the Gini coefficient. 12. The one or more non-transitory computer-readable storage media according to claim 11 , wherein the first group of first computer worker nodes are Map nodes and the second group of second computer worker nodes are Reduce nodes. 13. The one or more non-transitory computer-readable storage media acc
Tree-organised classifiers · CPC title
in which an application is distributed across nodes in the network (software deployment G06F8/60; multiprogramming arrangements G06F9/46) · CPC title
Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title
Trees, e.g. B+trees · CPC title
Ensemble learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.