Feature engineering using interactive learning between structured and unstructured data

US2023029218A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023029218-A1
Application numberUS-202117380189-A
CountryUS
Kind codeA1
Filing dateJul 20, 2021
Priority dateJul 20, 2021
Publication dateJan 26, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A concept associated with a feature used in machine learning model can be determined, the feature extracted from a first data source. A second data source containing the concept can be identified. An additional feature can be generated by performing a natural language processing on the second data source. The feature and the additional feature can be merged. A second machine learning model can be generated, which use the merged feature. A prediction result of the first machine learning model can be compared with a prediction result of the second machine learning model relative to ground truth data, to evaluate effective of the merged feature. Based on the evaluated effectiveness, the feature can be augmented with the merged feature in machine learning.

First claim

Opening claim text (preview).

What is claimed is: 1 . A system for feature engineering in a machine learning pipeline, comprising: a processor; a memory device coupled with the processor; the processor configured to at least: receive a feature extracted from a first data source, the feature used in a first machine learning model, the first machine learning model built to predict an outcome; determine a concept associated with the feature by traversing a concept graph; identify a second data source containing the concept associated with the feature; generate an additional feature by performing a natural language processing on the second data source; merge the feature and the additional feature; generate a second machine learning model for predicting the outcome using the merged feature; run the second machine learning model; compare a prediction result of the first machine learning model with a prediction result of the second machine learning model relative to ground truth data, to evaluate effective of the merged feature; and based on the evaluated effectiveness, augment the feature using the merged feature in machine learning. 2 . The system of claim 1 , wherein, to merge the feature and the additional feature, the processor is configured to fill-in a missing information associated with the feature based on the additional feature. 3 . The system of claim 1 , wherein, to merge the feature and the additional feature, the processor is configured to replace the feature with the additional feature. 4 . The system of claim 1 , wherein, to merge the feature and the additional feature, the processor is configured to add the additional feature to the feature. 5 . The system of claim 1 , wherein the processor is further configured to automatically generate the concept graph responsive to determining that the concept graph associated with the feature is not available. 6 . The system of claim 1 , wherein the first data source includes structured data source. 7 . The system of claim 1 , wherein the second data source includes unstructured data source. 8 . The system of claim 1 , wherein, to generate a second machine learning model, the processor is configured to train the second machine learning model based on a training set augmented with a plurality of data values associated with the merged feature. 9 . The system of claim 1 , wherein the processor is configured to determine the feature extracted from a first data source based on importance of the feature to the first machine learning model's outcome. 10 . A method of feature engineering in a machine learning pipeline, comprising: receiving a feature extracted from a first data source, the feature used in a first machine learning model, the first machine learning model built to predict an outcome; determining a concept associated with the feature by traversing a concept graph; identifying a second data source containing the concept associated with the feature; generating an additional feature by performing a natural language processing on the second data source; merging the feature and the additional feature; generating a second machine learning model for predicting the outcome using the merged feature; running the second machine learning model; comparing a prediction result of the first machine learning model with a prediction result of the second machine learning model relative to ground truth data, to evaluate effective of the merged feature; and based on the evaluated effectiveness, augmenting the feature using the merged feature in machine learning. 11 . The method of claim 10 , wherein the merging of the feature and the additional feature includes filling-in a missing information associated with the feature based on the additional feature. 12 . The method of claim 10 , wherein the merging of the feature and the additional feature includes replacing the feature with the additional feature. 13 . The method of claim 10 , wherein the merging of the feature and the additional feature includes adding the additional feature to the feature. 14 . The method of claim 10 , further including automatically generating the concept graph responsive to determining that the concept graph associated with the feature is not available. 15 . The method of claim 10 , wherein the first data source includes structured data source. 16 . The method of claim 10 , wherein the second data source includes unstructured data source. 17 . The method of claim 10 , wherein generating a second machine learning model include training the second machine learning model based on a training set augmented with a plurality of data values associated with the merged feature. 18 . The method of claim 10 , wherein the feature extracted from a first data source is determined based on importance of the feature to the first machine learning model's prediction outcome. 19 . A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions readable by a device to cause the device to: receive a feature extracted from a first data source, the feature used in a first machine learning model, the first machine learning model built to predict an outcome; determine a concept associated with the feature by traversing a concept graph; identify a second data source containing the concept associated with the feature; generate an additional feature by performing a natural language processing on the second data source; merge the feature and the additional feature; generate a second machine learning model for predicting the outcome using the merged feature; run the second machine learning model; compare a prediction result of the first machine learning model with a prediction result of the second machine learning model relative to ground truth data, to evaluate effective of the merged feature; and based on the evaluated effectiveness, augment the feature using the merged feature in machine learning. 20 . The computer program product of claim 19 , wherein the device is caused to automatically generate the concept graph responsive to determining that the concept graph associated with the feature is not available.

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title

  • Physics · mapped topic

  • G06N20/20Primary

    Ensemble learning · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023029218A1 cover?
A concept associated with a feature used in machine learning model can be determined, the feature extracted from a first data source. A second data source containing the concept can be identified. An additional feature can be generated by performing a natural language processing on the second data source. The feature and the additional feature can be merged. A second machine learning model can …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N20/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 26 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).