Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06N20/00. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Mar 07 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Pre-statistics of data for node of decision tree

US11600005B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11600005-B2
Application number	US-201816479536-A
Country	US
Kind code	B2
Filing date	Jan 16, 2018
Priority date	Jan 20, 2017
Publication date	Mar 7, 2023
Grant date	Mar 7, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of the subject matter described herein relate to generating a decision tree based on data pre-statistics. A plurality of data samples for a node of the decision tree are obtained, and the plurality of data samples have corresponding feature values with respect to a first feature. A target range is determined from a plurality of predefined numerical ranges so that the number of feature values falling into the target range is greater than a predetermined threshold number. Then, the remaining of the feature values other than the feature values falling into the target range are assigned to the respective numerical ranges, and the feature values falling into all the numerical ranges are counted based on the assignment of the remaining of the feature values, for allocation of the plurality of data samples to child nodes of the node. Accordingly, the data processing efficiency is substantially improved.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method of data processing based on a decision tree, comprising: determining that a number of samples in a set of data samples for a node of the decision tree is below a first threshold number; selecting all data samples in the set of data samples as a plurality of data samples, the plurality of data samples having corresponding feature values with respect to a first feature; determining, from a plurality of numerical ranges, a target range, with a number of feature values falling into the target range being greater than a second threshold number; assigning the remaining of the feature values, other than the feature values falling into the target range, to the plurality of numerical ranges; and counting, based on the assignment of the remaining of the feature values, the feature values falling into the plurality of numerical ranges, for allocation of the plurality of data samples to child nodes of the node. 2. The method according to claim 1 , wherein counting the feature values comprises: subtracting, by the number of feature values falling into the remaining of the numerical ranges other than the target range, from a total number of the plurality of data samples, as the number of feature values falling into the target range. 3. The method according to claim 1 , wherein the data processing is performed in distribution on a plurality of machines, the first threshold number being determined at least in part based on a first product of a number of features in the set of features and a number of the plurality of numerical ranges. 4. The method according to claim 3 , wherein the first threshold number is further determined at least in part based on a second product of a number of the plurality of machines and the first product. 5. The method according to claim 3 , wherein the child nodes include at least a left child node and a right child node, and the method further comprises: obtaining, at a first machine of the plurality of machines, a second feature and a threshold feature value of the second feature; selecting a subset of data samples from the set of data samples; and for each data sample in the subset of data samples, comparing a feature value of the data sample with respect to the second feature with the threshold feature value, determining, based on the comparison, whether the data sample is allocated to the left child node or the right child node, and sending, to a second machine of the plurality of machines, an indicator of one bit for indicating the determination. 6. The method according to claim 3 , further comprising: in response to the number of samples being greater than the first threshold number, selecting, at the first machine of the plurality of machines, a subset of the set of data samples as the plurality of data samples. 7. The method according to claim 1 , further comprising: allocating, based on the counting of the feature values, the data samples from the node to the child nodes; determining whether a plurality of leaf node candidates of the decision tree have been obtained, a number of the plurality of leaf node candidates being greater than a third threshold number; in response to determining that the plurality of leaf node candidates have been obtained, obtaining a bottom-layer sub-tree of the decision tree, the bottom-layer sub-tree including the leaf node candidates having a common parent node and the parent node; and in response to differences between data samples for at least one of the plurality of leaf node candidates in the bottom-layer sub-tree being below a threshold difference, removing the at least one of the plurality of leaf node candidates in the bottom-layer sub-tree. 8. The method according to claim 1 , wherein a feature value refers to a value indicating relevance between a data sample and a feature. 9. The method according to claim 1 , wherein the feature values are stored in storage areas allocated to respective ones of the plurality of numerical ranges with the feature values falling into the target range not being stored. 10. An electronic device, comprising: a processing unit; and a memory coupled to the processing unit and storing instructions, which, when executed by the processing unit, perform data processing based on a decision tree, comprising actions: determining that a number of samples in a set of data samples for a node of the decision tree is below a first threshold number; selecting all data samples in the set of data samples as a plurality of data samples, the plurality of data samples having corresponding feature values with respect to a first feature; determining, from a plurality of numerical ranges, a target range, with a number of feature values falling into the target range being greater than a second threshold number; assigning the remaining of the feature values, other than the feature values falling into the target range, to the plurality of numerical ranges; and counting, based on the assignment of the remaining of the feature values, the feature values falling into the plurality of numerical ranges, for allocation of the plurality of data samples to child nodes of the node. 11. The device according to claim 10 , wherein the data processing is performed in distribution on a plurality of machines, the second first threshold number being determined at least in part based on a first product of a number of features in the set of features and a number of the plurality of numerical ranges. 12. The device according to claim 11 , wherein the first threshold number is further determined at least in part based on a second product of a number of the plurality of machines and the first product. 13. The device according to claim 11 , wherein the child nodes include at least a left child node and a right child node, and the actions further comprise: obtaining, at a first machine of the plurality of machines, a second feature and a threshold feature value of the second feature; selecting a subset of data samples from the set of data samples; and for each data sample in the subset of data samples, comparing a feature value of the data sample with respect to the second feature with the threshold feature value, determining, based on the comparison, whether the data sample is allocated to the left child node or the right child node, and sending, to a second machine of the plurality of machines, an indicator of one bit for indicating the determination. 14. The device according to claim 11 , wherein the actions further comprise: in response to the number of samples being greater than the first threshold number, selecting a subset of the set of data samples as the plurality of data samples. 15. The device according to claim 10 , wherein the actions further comprise: allocating, based on the counting of the feature values, the data samples from the node to the child nodes; determining whether a plurality of leaf node candidates of the decision tree have been obtained, a number of the plurality of leaf node candidates being greater than a third threshold number; in response to determining that the plurality of leaf node candidates have been obtained, obtaining a bottom-layer sub-tree of the decision tree, the bottom-layer sub-tree including the leaf node candidates having a common parent node and the parent node; and in response to differences between data samples for at least one of the plurality of leaf node candidates in the bottom-layer sub-tree being below a threshold difference, removing the at least one of the plurality of leaf node candidates in the bottom-layer sub-tree.

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06N5/01
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
G06N20/00Primary
Machine learning · CPC title
G06F16/9027
Trees · CPC title
G06F18/24323
Tree-organised classifiers · CPC title
G06T7/162Primary
involving graph-based methods · CPC title

Patent family

Related publications grouped by family.

View patent family 61163800

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11600005B2 cover?: Embodiments of the subject matter described herein relate to generating a decision tree based on data pre-statistics. A plurality of data samples for a node of the decision tree are obtained, and the plurality of data samples have corresponding feature values with respect to a first feature. A target range is determined from a plurality of predefined numerical ranges so that the number of featu…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Mar 07 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).