Automated data table discovery for automated machine learning

US11620550B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11620550-B2
Application numberUS-202016989876-A
CountryUS
Kind codeB2
Filing dateAug 10, 2020
Priority dateAug 10, 2020
Publication dateApr 4, 2023
Grant dateApr 4, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments relate to a system, program product, and method for leveraging cognitive systems to facilitate the automated data table discovery for automated machine learning, and, more specifically, to leveraging a trained cognitive system to automatically search for additional data in an external data source that may be merged with an initial user-selected data table to generate a more robust machine learning model. Manual efforts to find and validate data appropriate for building and training a particular model for a particular task are significantly reduced. Specifically, a learning-based approach to leverage with machine learning models to automatically discover related datasets and join the datasets for a given initial dataset is disclosed herein. Operations that include dataset selection facilitate continued reinforcement learning of the systems.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer system comprising: a server comprising at least one processing device and at least one memory device operably coupled to the at least one processing device; and a data repository in operable communication with the server, the server configured to: receive a first dataset; access, automatically, the data repository, wherein the data repository includes a plurality of stored datasets; determine, automatically, from the plurality of stored datasets, one or more candidate datasets to further generate the one or more machine learning models, the one or more candidate datasets including resident data therein, at least a portion of the resident data complementary with the first dataset; facilitate selection of, from the one or more candidate datasets, one or more second datasets; join at least a portion of the complementary resident data from the one or more second datasets with the first dataset, thereby generating a joined dataset; and generate, automatically, through the joined dataset, one or more machine learning models. 2. The system of claim 1 , wherein the computer system is a cognitive system. 3. The system of claim 2 , wherein the cognitive system is an artificial intelligence (AI) platform, the method further comprising: the AI platform resident within the server, the AI platform in operable communication with the data repository, the AI platform comprising: a data manager comprising an autoAI table discovery module configured to facilitate execution of one or more operations by the server comprising the automatic determination of the one or more candidate datasets comprising: execute a first vector analysis, comprising: convert the first dataset into a first dataset vector; convert each stored dataset of the plurality of stored datasets into a respective stored dataset vector; compare the first dataset vector with each stored dataset vector, thereby generate one or more dataset vectors' comparisons; score each dataset vectors' comparison; and determine, subject to the one or more dataset vectors' comparisons, the respective score of the one or more dataset vectors' comparisons exceeds a predetermined first threshold, wherein the predetermined first threshold is at least partially established through training one or more machine learning dataset vector comparison algorithms. 4. The system of claim 3 , further comprising one or more user interfaces, wherein the autoAI table discovery module is further configured to facilitate the execution of one or more operations by the server comprising the selection of the one or more second datasets comprising: present, via the one or more user interfaces, the one or more candidate datasets; and facilitate the selection of, from the one or more candidate datasets, the one or more second datasets. 5. The system of claim 4 , wherein the autoAI table discovery module is further configured to facilitate the execution of one or more operations by the server comprising the training of the one or more machine learning dataset vector comparison algorithms comprising: analysis of historical selections of the one or more second datasets as a function of application of the one or more second datasets to respective machine learning models. 6. The system of claim 3 , wherein: the first dataset is configured into a tabular arrangement including a plurality of first columns and first rows of first data; and each second dataset of the one or more second datasets is configured into a tabular arrangement including a plurality of second columns and second rows of second data. 7. The system of claim 6 , wherein the autoAI table discovery module is further configured to facilitate the execution of one or more operations by the server comprising: execute a second vector analysis, comprising: convert each second column of the plurality of second columns into a respective second column vector; compare the first dataset vector with each respective second column vector, thereby generating one or more first dataset-second column vectors' comparisons; score each first dataset-second column vectors' comparison; determine, subject to the first dataset-second column vectors' comparisons, the respective score of each first dataset-second column vectors' comparison exceeds a predetermined second threshold, wherein the predetermined second threshold is at least partially established through training the one or more machine learning dataset vector comparison algorithms; determine, automatically, one or more suggested second columns for joining with the first dataset; select, from the one or more suggested second columns, one or more second columns for joining with the first dataset; and join the selected one or more second columns with the first dataset. 8. A computer program product, comprising: one or more computer readable storage media; and program instructions collectively stored on the one or more computer storage media, the program instructions comprising: program instructions to receive a first dataset; program instructions to access, automatically, the data repository, wherein the data repository includes a plurality of stored datasets; program instructions to determine, automatically, from the plurality of stored datasets, one or more candidate datasets to further generate the one or more machine learning models, the one or more candidate datasets including resident data therein, at least a portion of the resident data complementary with the first dataset; program instructions to facilitate selection of, from the one or more candidate datasets, one or more second datasets; program instructions to join at least a portion of the complementary resident data from the one or more second datasets with the first dataset, thereby generating a joined dataset; and program instructions to generate, automatically, through the joined dataset, one or more machine learning models. 9. The computer program product of claim 8 , further comprising: program instructions to execute a first vector analysis, comprising: program instructions to convert the first dataset into a first dataset vector; program instructions to convert each stored dataset of the plurality of stored datasets into a respective stored dataset vector; program instructions to compare the first dataset vector with each stored dataset vector, thereby generate one or more dataset vectors' comparisons; program instructions to score each dataset vectors' comparison; and program instructions to determine, subject to the one or more dataset vectors' comparisons, the respective score of the one or more dataset vectors' comparisons exceeds a predetermined first threshold, wherein the predetermined first threshold is at least partially established through training one or more machine learning dataset vector comparison algorithms. 10. The computer program product of claim 9 , further comprising: program instructions to present, via the one or more user interfaces, the one or more candidate datasets. 11. The computer program product of claim 10 , further comprising: program instructions to analyze historical selections of the one or more second datasets as a function of application of the one or more second datasets to respective machine learning models. 12. The computer program product of claim 9 , further comprising: program instructions to configure the first dataset into a tabular arrangement including a plurality of first columns and first rows of first data; and program instructions to configure each second dataset of the one or more second datasets into a tabular arrangement including a plurality of second

Assignees

Inventors

Classifications

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Supervised learning · CPC title

  • Reinforcement learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • Join operations · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11620550B2 cover?
Embodiments relate to a system, program product, and method for leveraging cognitive systems to facilitate the automated data table discovery for automated machine learning, and, more specifically, to leveraging a trained cognitive system to automatically search for additional data in an external data source that may be merged with an initial user-selected data table to generate a more robust m…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/258. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 04 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).