Method and apparatus for training model based on random forest

US11276013B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11276013-B2
Application numberUS-201816146907-A
CountryUS
Kind codeB2
Filing dateSep 28, 2018
Priority dateMar 31, 2016
Publication dateMar 15, 2022
Grant dateMar 15, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and apparatuses for training model based on random forest are provided. The method includes: dividing worker nodes into one or more groups; performing random sampling, by worker nodes in each group, in the preset sample data to obtain the target sample data; and training, by the worker nodes in each group, one or more decision tree objects using the target sample data. Example embodiments of the present disclosure do not need to scan the complete sample data for once, thereby greatly reducing the amount of data to be read, the time cost, and further the iterative update time of the model. The efficiency of training is improved.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for training a model based on random forest, comprising: dividing, by a computer system, computer worker nodes of the computer system into a first group of first computer worker nodes and a second group of second computer worker nodes; obtaining, by each of the first computer worker nodes, a subset of sample data; distributing, by each of the first computer worker nodes, the obtained subset of sample data to the second computer worker nodes based on random sampling; obtaining, by each of one or more of the second computer worker nodes, target sample data from the random sampling; and training, by each of the one or more second computer worker nodes, one or more decision tree objects of the model based on random forest using the target sample data, wherein the training comprises: calculating a frequency of a value of attribute information of each sample object of the target sample data with respect to a classification column, wherein the classification column of the attribute information is dichotomous; normalizing the frequency to obtain a weight of the value of the attribute information of each sample object of the target sample data; sorting the value of the attribute information according to the weight; calculating a Gini coefficient using the sorted value of the attribute information; and performing a splitting process on a tree node of a decision tree object of the one or more decision tree objects according to the Gini coefficient. 2. The method according to claim 1 , wherein the first group of first computer worker nodes are Map nodes and the second group of second computer worker nodes are Reduce nodes. 3. The method according to claim 1 , wherein distributing the obtained subset of sample data to the second computer worker nodes based on random sampling comprises: for each of the second group of second computer worker nodes, distributing or not distributing the obtained subset of sample data in a random manner. 4. The method according to claim 1 , wherein the training, by each of the one or more second computer worker nodes, one or more decision tree objects of the model based on random forest using the target sample data includes: training, by each of the one or more second computer worker nodes, a decision tree of a random forest of the model using the target sample data. 5. The method according to claim 1 , wherein the calculating the Gini coefficient using the sorted value of the attribute information includes: dividing the sorted value of the attribute information into two attribute subsets according to an order of the sorting; and calculating the Gini coefficient using the two attribute subsets in sequence. 6. A computer system, comprising: one or more processors; and one or more memories storing thereon computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: dividing computer worker nodes of the computer system into a first group of first computer worker nodes and a second group of second computer worker nodes; obtaining, at each of the first computer worker nodes, a subset of sample data; distributing, at each of the first computer worker nodes, the obtained subset of sample data to the second computer worker nodes based on random sampling; obtaining, at each of one or more of the second computer worker nodes, target sample data from the random sampling; and training, at each of the one or more second computer worker nodes, one or more decision tree objects of a model based on random forest using the target sample data, wherein the training comprises: calculating a frequency of a value of attribute information of each sample object of the target sample data with respect to a classification column, wherein the classification column of the attribute information is dichotomous; normalizing the frequency to obtain a weight of the value of the attribute information of each sample object of the target sample data; sorting the value of the attribute information according to the weight; calculating a Gini coefficient using the sorted value of the attribute information; and performing a splitting process on a tree node of a decision tree object of the one or more decision tree objects according to the Gini coefficient. 7. The computer system according to claim 6 , wherein the first group of first computer worker nodes are Map nodes and the second group of second computer worker nodes are Reduce nodes. 8. The computer system according to claim 6 , wherein distributing the obtained subset of sample data to the second computer worker nodes based on random sampling comprises: for each of the second group of second computer worker nodes, distributing or not distributing the obtained subset of sample data in a random manner. 9. The computer system according to claim 6 , wherein: the computer system is a single computer, and each of the first group of first computer worker nodes and the second group of second computer worker nodes is a core of a Central Processing Unit; or the computer system is a computer cluster, and each of the first group of first computer worker nodes and the second group of second computer worker nodes is a single computer. 10. The method according to claim 1 , wherein: the computer system is a single computer, and each of the first group of first computer worker nodes and the second group of second computer worker nodes is a core of a Central Processing Unit; or the computer system is a computer cluster, and each of the first group of first computer worker nodes and the second group of second computer worker nodes is a single computer. 11. One or more non-transitory computer-readable storage media of a computer system, the non-transitory computer-readable storage media storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: dividing computer worker nodes of the computer system into a first group of first computer worker nodes and a second group of second computer worker nodes; obtaining, at each of the first computer worker nodes, a subset of sample data; distributing, at each of the first computer worker nodes, the obtained subset of sample data to the second computer worker nodes based on random sampling; obtaining, at each of one or more of the second computer worker nodes, target sample data from the random sampling; and training, at each of the one or more second computer worker nodes, one or more decision tree objects of a model based on random forest using the target sample data, wherein the training comprises: calculating a frequency of a value of attribute information of each sample object of the target sample data with respect to a classification column, wherein the classification column of the attribute information is dichotomous; normalizing the frequency to obtain a weight of the value of the attribute information of each sample object of the target sample data; sorting the value of the attribute information according to the weight; calculating a Gini coefficient using the sorted value of the attribute information; and performing a splitting process on a tree node of a decision tree object of the one or more decision tree objects according to the Gini coefficient. 12. The one or more non-transitory computer-readable storage media according to claim 11 , wherein the first group of first computer worker nodes are Map nodes and the second group of second computer worker nodes are Reduce nodes. 13. The one or more non-transitory computer-readable storage media acc

Assignees

Inventors

Classifications

  • Tree-organised classifiers · CPC title

  • H04L67/10Primary

    in which an application is distributed across nodes in the network (software deployment G06F8/60; multiprogramming arrangements G06F9/46) · CPC title

  • Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title

  • Trees, e.g. B+trees · CPC title

  • Ensemble learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11276013B2 cover?
Methods and apparatuses for training model based on random forest are provided. The method includes: dividing worker nodes into one or more groups; performing random sampling, by worker nodes in each group, in the preset sample data to obtain the target sample data; and training, by the worker nodes in each group, one or more decision tree objects using the target sample data. Example embodimen…
Who is the assignee on this patent?
Alibaba Group Holding Ltd
What technology area does this patent fall under?
Primary CPC classification G06F18/24323. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 15 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).