Identification and personalized protection of text data using shapley values
US-2021165965-A1 · Jun 3, 2021 · US
US2023029218A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2023029218-A1 |
| Application number | US-202117380189-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jul 20, 2021 |
| Priority date | Jul 20, 2021 |
| Publication date | Jan 26, 2023 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A concept associated with a feature used in machine learning model can be determined, the feature extracted from a first data source. A second data source containing the concept can be identified. An additional feature can be generated by performing a natural language processing on the second data source. The feature and the additional feature can be merged. A second machine learning model can be generated, which use the merged feature. A prediction result of the first machine learning model can be compared with a prediction result of the second machine learning model relative to ground truth data, to evaluate effective of the merged feature. Based on the evaluated effectiveness, the feature can be augmented with the merged feature in machine learning.
Opening claim text (preview).
What is claimed is: 1 . A system for feature engineering in a machine learning pipeline, comprising: a processor; a memory device coupled with the processor; the processor configured to at least: receive a feature extracted from a first data source, the feature used in a first machine learning model, the first machine learning model built to predict an outcome; determine a concept associated with the feature by traversing a concept graph; identify a second data source containing the concept associated with the feature; generate an additional feature by performing a natural language processing on the second data source; merge the feature and the additional feature; generate a second machine learning model for predicting the outcome using the merged feature; run the second machine learning model; compare a prediction result of the first machine learning model with a prediction result of the second machine learning model relative to ground truth data, to evaluate effective of the merged feature; and based on the evaluated effectiveness, augment the feature using the merged feature in machine learning. 2 . The system of claim 1 , wherein, to merge the feature and the additional feature, the processor is configured to fill-in a missing information associated with the feature based on the additional feature. 3 . The system of claim 1 , wherein, to merge the feature and the additional feature, the processor is configured to replace the feature with the additional feature. 4 . The system of claim 1 , wherein, to merge the feature and the additional feature, the processor is configured to add the additional feature to the feature. 5 . The system of claim 1 , wherein the processor is further configured to automatically generate the concept graph responsive to determining that the concept graph associated with the feature is not available. 6 . The system of claim 1 , wherein the first data source includes structured data source. 7 . The system of claim 1 , wherein the second data source includes unstructured data source. 8 . The system of claim 1 , wherein, to generate a second machine learning model, the processor is configured to train the second machine learning model based on a training set augmented with a plurality of data values associated with the merged feature. 9 . The system of claim 1 , wherein the processor is configured to determine the feature extracted from a first data source based on importance of the feature to the first machine learning model's outcome. 10 . A method of feature engineering in a machine learning pipeline, comprising: receiving a feature extracted from a first data source, the feature used in a first machine learning model, the first machine learning model built to predict an outcome; determining a concept associated with the feature by traversing a concept graph; identifying a second data source containing the concept associated with the feature; generating an additional feature by performing a natural language processing on the second data source; merging the feature and the additional feature; generating a second machine learning model for predicting the outcome using the merged feature; running the second machine learning model; comparing a prediction result of the first machine learning model with a prediction result of the second machine learning model relative to ground truth data, to evaluate effective of the merged feature; and based on the evaluated effectiveness, augmenting the feature using the merged feature in machine learning. 11 . The method of claim 10 , wherein the merging of the feature and the additional feature includes filling-in a missing information associated with the feature based on the additional feature. 12 . The method of claim 10 , wherein the merging of the feature and the additional feature includes replacing the feature with the additional feature. 13 . The method of claim 10 , wherein the merging of the feature and the additional feature includes adding the additional feature to the feature. 14 . The method of claim 10 , further including automatically generating the concept graph responsive to determining that the concept graph associated with the feature is not available. 15 . The method of claim 10 , wherein the first data source includes structured data source. 16 . The method of claim 10 , wherein the second data source includes unstructured data source. 17 . The method of claim 10 , wherein generating a second machine learning model include training the second machine learning model based on a training set augmented with a plurality of data values associated with the merged feature. 18 . The method of claim 10 , wherein the feature extracted from a first data source is determined based on importance of the feature to the first machine learning model's prediction outcome. 19 . A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions readable by a device to cause the device to: receive a feature extracted from a first data source, the feature used in a first machine learning model, the first machine learning model built to predict an outcome; determine a concept associated with the feature by traversing a concept graph; identify a second data source containing the concept associated with the feature; generate an additional feature by performing a natural language processing on the second data source; merge the feature and the additional feature; generate a second machine learning model for predicting the outcome using the merged feature; run the second machine learning model; compare a prediction result of the first machine learning model with a prediction result of the second machine learning model relative to ground truth data, to evaluate effective of the merged feature; and based on the evaluated effectiveness, augment the feature using the merged feature in machine learning. 20 . The computer program product of claim 19 , wherein the device is caused to automatically generate the concept graph responsive to determining that the concept graph associated with the feature is not available.
Physics · mapped topic
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
Physics · mapped topic
Ensemble learning · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.