Decision tree learning with missing data

US11934928B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11934928-B2
Application numberUS-202318131638-A
CountryUS
Kind codeB2
Filing dateApr 6, 2023
Priority dateFeb 19, 2020
Publication dateMar 19, 2024
Grant dateMar 19, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are various embodiments for using decision trees for machine-learning when data is missing from a data set. A first attribute for splitting a plurality of records is identified. Then, the plurality of records are split into a first subset of records and a second subset of records. The first subset of records can include each of the plurality of records where a value is present for the first attribute and the second subset of records comprising each of the plurality of records where the value is absent for the first attribute. Finally, a node can be added to a decision tree that reflects the split of the plurality of records into the first subset of records and the second subset of records.

First claim

Opening claim text (preview).

Therefore, the following is claimed: 1. A system, comprising: a computing device comprising a processor and a memory; and machine-readable instructions stored in the memory that, when executed by the processor, cause the computing device to at least: receive a plurality of records; generate a set of binary encodings for each record in the plurality of records; identify a first attribute for splitting the plurality of records based at least in part on a selection criterion; identify a value for the first attribute in the plurality of records based at least in part on the set of binary encodings; split the plurality of records into a first subset of records and a second subset of records, wherein the first subset of records comprises the plurality of records where the value is present for the first attribute and the second subset of records comprises the plurality of records where the value is absent for the first attribute; and add a node to a decision tree that reflects the split of the plurality of records. 2. The system of claim 1 , wherein the selection criterion is at least one of a Gini impurity, a variance reduction, or an amount of information gain. 3. The system of 1 , wherein the node is a first node and the machine-readable instructions, when executed, further cause the computing device to at least: split the first subset of records into a third subset of records and a fourth subset of records based at least in part on the first attribute; and add a second node to the decision tree that reflects the split of the first subset of records into the third subset of records and the fourth subset of records. 4. The system of claim 3 , wherein the machine-readable instructions, when executed by the processor, further cause the computing device to at least: identify a second attribute to be a second unique identifier for splitting the second subset of records based at least in part on the selection criterion; and split the second subset of records into a fifth subset of records and a sixth subset of records, the fifth subset of records comprising individual ones of the plurality of records where the value is present for the second attribute and the sixth subset of records comprising individual ones of the plurality of records where the value is absent for the second attribute. 5. The system of claim 1 , wherein the machine-readable instructions that cause the computing device to identify the first attribute for splitting the plurality of records further cause the computing device to at least: calculate a Gini impurity for each attribute of a plurality of attributes associated with each of the plurality of records; and identify the first attribute based at least in part on the Gini impurity. 6. The system of claim 1 , wherein the machine-readable instructions that cause the computing device to identify the first attribute for splitting the plurality of records further cause the computing device to at least: calculate an amount of information gain for each attribute of a plurality of attributes associated with each of the plurality of records; and identify the first attribute based at least in part on the calculated amount of information gain. 7. The system of claim 1 , wherein the machine-readable instructions that cause the computing device to identify the first attribute for splitting the plurality of records further cause the computing device to at least: calculate a variance reduction for each attribute of a plurality of attributes associated with each of the plurality of records; and identify the first attribute based at least in part on the calculated variance reduction. 8. The system of claim 1 , wherein the machine-readable instructions further cause the computing device to determine a stopping criterion, wherein the stopping criterion is based at least in part on one or more of: a predefined number of splits; or a predefined number of levels in the decision tree. 9. A computer-implemented method, comprising: receiving a plurality of records; generating a set of binary encodings for each record in the plurality of records; identifying a first attribute for splitting the plurality of records based at least in part on a selection criterion; identifying a value for the first attribute in the plurality of records based at least in part on the set of binary encodings; splitting the plurality of records into a first subset of records and a second subset of records, wherein the first subset of records comprises the plurality of records where the value is present for the first attribute and the second subset of records comprises the plurality of records where the value is absent for the first attribute; and adding a node to a decision tree that reflects the split of the plurality of records. 10. The computer-implemented method of claim 9 , wherein the selection criterion is at least one of a Gini impurity, a variance reduction, or an amount of information gain. 11. The computer-implemented method of claim 9 , wherein the node is a first node and the method further comprises: splitting the first subset of records into a third subset of records and a fourth subset of records based at least in part on the first attribute; and adding a second node to the decision tree that reflects the split of the first subset of records into the third subset of records and the fourth subset of records. 12. The computer-implemented method of claim 11 , further comprising: identifying a second attribute to be a second unique identifier for splitting the second subset of records based at least in part on the selection criterion; and splitting the second subset of records into a fifth subset of records and a sixth subset of records, the fifth subset of records comprising individual ones of the plurality of records where the value is present for the second attribute and the sixth subset of records comprising individual ones of the plurality of records where the value is absent for the second attribute. 13. The computer-implemented method of claim 9 , wherein identifying the first attribute for splitting the plurality of records further comprises: calculating a Gini impurity for each attribute of a plurality of attributes associated with each of the plurality of records; and identifying the first attribute based at least in part on the Gini impurity. 14. The computer-implemented method of claim 9 , wherein identifying the first attribute for splitting the plurality of records further comprises: calculating an amount of information gain for each attribute of a plurality of attributes associated with each of the plurality of records; and identifying the first attribute based at least in part on the calculated amount of information gain. 15. The computer-implemented method of claim 9 , wherein identifying the first attribute for splitting the plurality of records further comprises: calculating a variance reduction for each attribute of a plurality of attributes associated with each of the plurality of records; and identifying the first attribute based at least in part on the calculated variance reduction. 16. The computer-implemented method of claim 9 , further comprising determining a stopping criterion, wherein the stopping criterion is based at least in part on one or more of a predefined number of splits or a predefined number of levels in the decision tree. 17. A non-transitory, computer-readable medium, comprising machine-readable instructions stored in a memory that, when executed by a processor of a computing device, cause the computing devic

Assignees

Inventors

Classifications

  • G06N20/00Primary

    Machine learning · CPC title

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11934928B2 cover?
Disclosed are various embodiments for using decision trees for machine-learning when data is missing from a data set. A first attribute for splitting a plurality of records is identified. Then, the plurality of records are split into a first subset of records and a second subset of records. The first subset of records can include each of the plurality of records where a value is present for the…
Who is the assignee on this patent?
American Express Travel Related Services Co Inc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 19 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).