Interactive feature engineering in automatic machine learning with domain knowledge

US12412102B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12412102-B2
Application numberUS-202117317242-A
CountryUS
Kind codeB2
Filing dateMay 11, 2021
Priority dateMay 11, 2021
Publication dateSep 9, 2025
Grant dateSep 9, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A dataset including features and values associated with the features can be received. Each of the features in the dataset can be mapped to a corresponding node in a knowledge graph based on the concept represented by the corresponding node. The knowledge graph can be traversed to find a candidate node connected to at least one mapped node, the candidate node not being mapped to a feature in the dataset. A concept associated with the candidate node can be identified as a new feature. A machine learning model pipeline can use the features in the dataset and the new feature to select a subset of features for training a machine learning model.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving, by at least one processor performing feature engineering in an automatic machine learning pipeline, a dataset including features; mapping, by the at least one processor, the features in the dataset to nodes of a knowledge graph, the nodes representing respective concepts, the knowledge graph further including edges connecting the nodes and representing respective relationships between respective two of the nodes that a respective edge connects, wherein the mapping occurs based on the concept represented by the corresponding node; traversing, by the at least one processor, the knowledge graph to find a candidate node existing in the knowledge graph and connected to at least one mapped node that is mapped to at least one feature in the dataset, the candidate node not being mapped to any of the features in the dataset; identifying, by the at least one processor, a concept associated with the candidate node as a new feature; presenting a user interface that includes the new feature and the respective concepts mapped to the features as user-engageable elements that are changeable via user input; and receiving, via user input into one or more of the user-engageable elements of the user interface, a change for the new feature, wherein the automatic machine learning pipeline uses the features in the dataset and the changed new feature to select a subset of features for training a machine learning model. 2. The method of claim 1 , wherein the candidate node identified in the knowledge graph is a first distance away from the at least one mapped node, the first distance being within a pre-determined threshold. 3. The method of claim 1 , wherein the candidate node identified in the knowledge graph includes a formula for deriving a concept associated with the candidate node. 4. The method of claim 1 , wherein the candidate node identified in the knowledge graph includes a formula for deriving, using the feature in the dataset mapped to the candidate node, a concept associated with the candidate node. 5. The method of claim 1 , wherein the at least one mapped node includes two or more nodes. 6. The method of claim 1 , further including causing presenting of each of the features in the dataset with a concept of the corresponding mapped node. 7. The method of claim 6 , further including allowing a user to change the presented concept. 8. The method of claim 6 , wherein said each of the features in the dataset with a concept of the corresponding mapped node is visualized as a table of features. 9. The method of claim 1 , further including training the machine learning model using the subset. 10. A system comprising: a processor; and a memory device coupled with the processor; the processor configured at least to: receive a dataset including features and values associated with the features in performing feature engineering in an automatic machine learning pipeline; map the features in the dataset to nodes of a knowledge graph, the nodes representing respective concepts, the knowledge graph further including edges connecting the nodes and representing respective relationships between respective two of the nodes that a respective edge connects, wherein the mapping occurs based on the concept represented by the corresponding node; traverse the knowledge graph to find a candidate node existing in the knowledge graph and connected to at least one mapped node that is mapped to at least one feature in the dataset, the candidate node not being mapped to any of the features in the dataset; identify a concept associated with the candidate node as a new feature; present a user interface that includes the new feature and the respective concepts mapped to the features as user-engageable elements that are changeable via user input; and receive, via user input into one or more of the user-engageable elements of the user interface, a change for the new feature, wherein the automatic machine learning pipeline uses the features in the dataset and the changed new feature to select a subset of features for training a machine learning model. 11. The system of claim 10 , wherein the candidate node identified in the knowledge graph is a first distance away from the at least one mapped node, the first distance being within a pre-determined threshold. 12. The system of claim 10 , wherein the candidate node identified in the knowledge graph includes a formula for deriving the new feature associated with the candidate node. 13. The system of claim 10 , wherein the candidate node identified in the knowledge graph includes a formula for deriving the new feature, using the feature in the dataset mapped to the candidate node. 14. The system of claim 10 , wherein the at least one mapped node includes two or more nodes. 15. The system of claim 10 , wherein the user interface is configured to present each of the features in the dataset with a concept of the corresponding mapped node in a row and column format. 16. The system of claim 15 , wherein the user interface allows a user to change the presented concept. 17. The system of claim 15 , wherein the user interface is configured to allow a user to modify the new feature. 18. The system of claim 10 , wherein the processor is further configured to train the machine learning model using the subset. 19. A computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions readable by a device to cause the device to: receive a dataset including features and values associated with the features in performing feature engineering in an automatic machine learning pipeline; map the features in the dataset to nodes of a knowledge graph, the nodes representing respective concepts, the knowledge graph further including edges connecting the nodes and representing respective relationships between respective two of the nodes that a respective edge connects, wherein the mapping occurs based on the concept represented by the corresponding node; traverse the knowledge graph to find a candidate node existing in the knowledge graph and connected to at least one mapped node that is mapped to at least one feature in the dataset, the candidate node not being mapped to any of the features in the dataset; identify a concept associated with the candidate node as a new feature; present a user interface that includes the new feature and the respective concepts mapped to the features as user-engageable elements that are changeable via user input; and receive, via user input into one or more of the user-engageable elements of the user interface, a change for the new feature, wherein the automatic machine learning pipeline uses the features in the dataset and the changed new feature to select a subset of features for training a machine learning model. 20. A system comprising: a processor; and a user interface; the processor configured at least to: receive a dataset including features and values associated with the features in performing feature engineering in an automatic machine learning pipeline; map the features in the dataset to nodes of a knowledge graph, the nodes representing respective concepts, the knowledge graph further including edges connecting the nodes and representing respective relationships between respective two of the nodes that a respective edge connects, wherein the mapping occurs based on the concept represented by the corresponding node; travers

Assignees

Inventors

Classifications

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Selection of the most significant subset of features · CPC title

  • Machine learning · CPC title

  • G06N5/022Primary

    Knowledge engineering; Knowledge acquisition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12412102B2 cover?
A dataset including features and values associated with the features can be received. Each of the features in the dataset can be mapped to a corresponding node in a knowledge graph based on the concept represented by the corresponding node. The knowledge graph can be traversed to find a candidate node connected to at least one mapped node, the candidate node not being mapped to a feature in the…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N5/022. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 09 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).