Data set generation for testing of machine learning pipelines

US11537936B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11537936-B2
Application numberUS-201916250770-A
CountryUS
Kind codeB2
Filing dateJan 17, 2019
Priority dateJan 17, 2019
Publication dateDec 27, 2022
Grant dateDec 27, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system may include memory containing: (i) a master data set representable in columns and rows, and (ii) a query expression. The system may include a software application configured to apply a machine learning (ML) pipeline to an input data set. The system may include a computing device configured to: obtain the master data set and the query expression; apply the query expression to the master data set to generate a test data set, where applying the query expression comprises, based on content of the query expression, generating the test data set to have one or more columns or one or more rows fewer than the master data set; apply the ML pipeline to the test data set, where applying the ML pipeline results in either generation of a test ML model from the test data set or indication of an error in the test data set; and delete the test data set from the memory.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: memory containing: (i) a master data set representable in columns and rows, wherein the columns define fields of the master data set and the rows define entries in the master data set, and (ii) a query expression; a software application configured to apply a machine learning (ML) pipeline to a test data set, wherein the ML pipeline includes a build determination phase and an ML model building phase, wherein the build determination phase decides whether to invoke the ML model building phase based on characteristics of the test data set, and wherein the ML model building phase generates an ML model from the test data set; and a computing device configured to: obtain, from the memory, the master data set and the query expression; apply the query expression to the master data set to generate the test data set from the master data set, wherein applying the query expression comprises, based on content of the query expression, generating the test data set to have one or more columns or one or more rows fewer than the master data set, wherein the query expression specifies one or more columns of the master data set, one or more rows of the master data set, or a combination thereof; store, in the memory, the test data set; apply, by way of the software application, the ML pipeline to the test data set, wherein applying the ML pipeline results in either generation of a test ML model from the test data set or indication of an error in the test data set; and in response to applying the ML pipeline to the test data set, delete the test data set from the memory. 2. The system of claim 1 , wherein the memory, the software application, and the computing device are disposed within a computational instance of a remote network management platform, and wherein the master data set was derived from activity that took place on a managed network associated with the computational instance. 3. The system of claim 2 , wherein the computational instance is a centralized computational instance shared by a plurality of managed networks, and wherein the managed network accesses the central computational instance by way of a particular computational instance that is dedicated to the managed network. 4. The system of claim 1 , wherein obtaining the master data set comprises: determining that the query expression specifies combining two or more input files; and performing a merge or a join operation on the two or more input files to generate the master data set. 5. The system of claim 1 , wherein applying the query expression to the master data set comprises: generating the test data set to have only the one or more columns that were specified, only the one or more rows that were specified, or a combination thereof. 6. The system of claim 1 , wherein the query expression specifies replacing instances of a string in a particular one of the columns with a replacement string, and wherein applying the query expression to the master data set comprises: finding each of the instances of the string in the particular one of the columns; and representing, in the test data set, each of the instances of the string with the replacement string. 7. The system of claim 1 , wherein the query expression specifies replacing rows of text in a particular one of the columns with one of a plurality of replacement strings, and wherein applying the query expression to the master data set comprises: representing, in the test data set, rows of text in a particular one of the columns with a string randomly selected from the plurality of replacement strings. 8. The system of claim 1 , wherein the query expression specifies translating rows of text in a particular one of the columns from a first language to a second language, and wherein applying the query expression to the master data set comprises: transmitting, to an external application programming interface, the rows of text; receiving, from the external application programming interface, the rows of text as translated into the second language; and representing, in the test data set, the rows of text with the translations thereof. 9. The system of claim 1 , wherein the master data set is stored in an input file, wherein the query expression specifies the input file as a source and an output file as a destination, and wherein applying the query expression to the master data set comprises: reading, from the input file, the master data set; and writing, to the output file, the test data set. 10. The system of claim 1 , wherein the query expression contains a filter to be applied to a particular one of the columns, wherein the filter is based on a type of content in the particular one of the columns, and wherein applying the query expression to the master data set comprises: representing, in the test data set, only rows with entries for the particular one of the columns that match the filter. 11. The system of claim 10 , wherein the filter specifies a range of values or a text string. 12. The system of claim 10 , wherein the filter specifies a density for the particular one of the columns, and wherein representing, in the test data set, only rows with entries for the particular one of the columns that match the filter comprises: representing, in the test data set, rows with null and non-null values with in accordance with the density. 13. The system of claim 10 , wherein the filter specifies a distribution for the particular one of the columns, and wherein representing, in the test data set, only rows with entries for the particular one of the columns that match the filter comprises: representing, in the test data set, rows that exhibit values in accordance with the distribution. 14. The system of claim 10 , wherein the filter specifies a user-defined operation for the particular one of the columns, and wherein representing, in the test data set, only rows with entries for the particular one of the columns that match the filter comprises: representing, in the test data set, rows that exhibit values in accordance with the user-defined operation. 15. The system of claim 1 , wherein the query expression specifies a limit to rows in the test data set, and wherein generating the test data set to have one or more columns or one or more rows fewer than the master data set comprises: generating the test data set to have no more than a number of rows defined by the limit. 16. A computer-implemented method comprising: obtaining, by a computing device and from a memory, a master data set and a query expression, wherein the master data set is representable in columns and rows, and wherein the columns define fields of the master data set and the rows define entries in the master data set; applying, by the computing device, the query expression to the master data set to generate a test data set from the master data set, wherein applying the query expression comprises, based on content of the query expression, generating the test data set to have one or more columns or one or more rows fewer than the master data set, wherein the query expression specifies one or more columns of the master data set, one or more rows of the master data set, or a combination thereof; storing, by the computing device and in the memory, the test data set; applying, by the computing device, a machine learning (ML) pipeline to the test data set, wherein the ML pipeline includes a build determination phase and an ML model building phase, wherein the build determination phase decides whether to invoke the ML model building phase based on characteristics of an input dat

Assignees

Inventors

Classifications

  • Tablespace storage structures; Management thereof · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Classification techniques · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

  • Join operations · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11537936B2 cover?
A system may include memory containing: (i) a master data set representable in columns and rows, and (ii) a query expression. The system may include a software application configured to apply a machine learning (ML) pipeline to an input data set. The system may include a computing device configured to: obtain the master data set and the query expression; apply the query expression to the master…
Who is the assignee on this patent?
Servicenow Inc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 27 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).