Automated feature engineering for machine learning models
US-2022076164-A1 · Mar 10, 2022 · US
US11620550B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11620550-B2 |
| Application number | US-202016989876-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 10, 2020 |
| Priority date | Aug 10, 2020 |
| Publication date | Apr 4, 2023 |
| Grant date | Apr 4, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments relate to a system, program product, and method for leveraging cognitive systems to facilitate the automated data table discovery for automated machine learning, and, more specifically, to leveraging a trained cognitive system to automatically search for additional data in an external data source that may be merged with an initial user-selected data table to generate a more robust machine learning model. Manual efforts to find and validate data appropriate for building and training a particular model for a particular task are significantly reduced. Specifically, a learning-based approach to leverage with machine learning models to automatically discover related datasets and join the datasets for a given initial dataset is disclosed herein. Operations that include dataset selection facilitate continued reinforcement learning of the systems.
Opening claim text (preview).
What is claimed is: 1. A computer system comprising: a server comprising at least one processing device and at least one memory device operably coupled to the at least one processing device; and a data repository in operable communication with the server, the server configured to: receive a first dataset; access, automatically, the data repository, wherein the data repository includes a plurality of stored datasets; determine, automatically, from the plurality of stored datasets, one or more candidate datasets to further generate the one or more machine learning models, the one or more candidate datasets including resident data therein, at least a portion of the resident data complementary with the first dataset; facilitate selection of, from the one or more candidate datasets, one or more second datasets; join at least a portion of the complementary resident data from the one or more second datasets with the first dataset, thereby generating a joined dataset; and generate, automatically, through the joined dataset, one or more machine learning models. 2. The system of claim 1 , wherein the computer system is a cognitive system. 3. The system of claim 2 , wherein the cognitive system is an artificial intelligence (AI) platform, the method further comprising: the AI platform resident within the server, the AI platform in operable communication with the data repository, the AI platform comprising: a data manager comprising an autoAI table discovery module configured to facilitate execution of one or more operations by the server comprising the automatic determination of the one or more candidate datasets comprising: execute a first vector analysis, comprising: convert the first dataset into a first dataset vector; convert each stored dataset of the plurality of stored datasets into a respective stored dataset vector; compare the first dataset vector with each stored dataset vector, thereby generate one or more dataset vectors' comparisons; score each dataset vectors' comparison; and determine, subject to the one or more dataset vectors' comparisons, the respective score of the one or more dataset vectors' comparisons exceeds a predetermined first threshold, wherein the predetermined first threshold is at least partially established through training one or more machine learning dataset vector comparison algorithms. 4. The system of claim 3 , further comprising one or more user interfaces, wherein the autoAI table discovery module is further configured to facilitate the execution of one or more operations by the server comprising the selection of the one or more second datasets comprising: present, via the one or more user interfaces, the one or more candidate datasets; and facilitate the selection of, from the one or more candidate datasets, the one or more second datasets. 5. The system of claim 4 , wherein the autoAI table discovery module is further configured to facilitate the execution of one or more operations by the server comprising the training of the one or more machine learning dataset vector comparison algorithms comprising: analysis of historical selections of the one or more second datasets as a function of application of the one or more second datasets to respective machine learning models. 6. The system of claim 3 , wherein: the first dataset is configured into a tabular arrangement including a plurality of first columns and first rows of first data; and each second dataset of the one or more second datasets is configured into a tabular arrangement including a plurality of second columns and second rows of second data. 7. The system of claim 6 , wherein the autoAI table discovery module is further configured to facilitate the execution of one or more operations by the server comprising: execute a second vector analysis, comprising: convert each second column of the plurality of second columns into a respective second column vector; compare the first dataset vector with each respective second column vector, thereby generating one or more first dataset-second column vectors' comparisons; score each first dataset-second column vectors' comparison; determine, subject to the first dataset-second column vectors' comparisons, the respective score of each first dataset-second column vectors' comparison exceeds a predetermined second threshold, wherein the predetermined second threshold is at least partially established through training the one or more machine learning dataset vector comparison algorithms; determine, automatically, one or more suggested second columns for joining with the first dataset; select, from the one or more suggested second columns, one or more second columns for joining with the first dataset; and join the selected one or more second columns with the first dataset. 8. A computer program product, comprising: one or more computer readable storage media; and program instructions collectively stored on the one or more computer storage media, the program instructions comprising: program instructions to receive a first dataset; program instructions to access, automatically, the data repository, wherein the data repository includes a plurality of stored datasets; program instructions to determine, automatically, from the plurality of stored datasets, one or more candidate datasets to further generate the one or more machine learning models, the one or more candidate datasets including resident data therein, at least a portion of the resident data complementary with the first dataset; program instructions to facilitate selection of, from the one or more candidate datasets, one or more second datasets; program instructions to join at least a portion of the complementary resident data from the one or more second datasets with the first dataset, thereby generating a joined dataset; and program instructions to generate, automatically, through the joined dataset, one or more machine learning models. 9. The computer program product of claim 8 , further comprising: program instructions to execute a first vector analysis, comprising: program instructions to convert the first dataset into a first dataset vector; program instructions to convert each stored dataset of the plurality of stored datasets into a respective stored dataset vector; program instructions to compare the first dataset vector with each stored dataset vector, thereby generate one or more dataset vectors' comparisons; program instructions to score each dataset vectors' comparison; and program instructions to determine, subject to the one or more dataset vectors' comparisons, the respective score of the one or more dataset vectors' comparisons exceeds a predetermined first threshold, wherein the predetermined first threshold is at least partially established through training one or more machine learning dataset vector comparison algorithms. 10. The computer program product of claim 9 , further comprising: program instructions to present, via the one or more user interfaces, the one or more candidate datasets. 11. The computer program product of claim 10 , further comprising: program instructions to analyze historical selections of the one or more second datasets as a function of application of the one or more second datasets to respective machine learning models. 12. The computer program product of claim 9 , further comprising: program instructions to configure the first dataset into a tabular arrangement including a plurality of first columns and first rows of first data; and program instructions to configure each second dataset of the one or more second datasets into a tabular arrangement including a plurality of second
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Supervised learning · CPC title
Reinforcement learning · CPC title
Convolutional networks [CNN, ConvNet] · CPC title
Join operations · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.