Data health evaluation using generative language models

US12579115B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12579115-B2
Application numberUS-202318374991-A
CountryUS
Kind codeB2
Filing dateSep 29, 2023
Priority dateSep 29, 2023
Publication dateMar 17, 2026
Grant dateMar 17, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The disclosed concepts relate to leveraging a language model to identify data health issues in a data set. One example method involves accessing a data set. The example method also involves, using an automated evaluation planning agent, inputting a prompt to generate a data evaluation plan for the data set to a generative language model, the prompt including context describing the data set. The example method also involves receiving the data evaluation plan generated by the generative language model and identifying one or more data health issues in the data set by performing the data evaluation plan using an automated evaluation plan execution agent.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A method performed on a computing device, the method comprising: accessing a data set; by an automated evaluation planning agent: generating a prompt requesting generation of a data evaluation plan for the data set, the prompt having context describing the data set; invoking an application programming interface that provides the prompt to a generative language model; and receiving, via the application programming interface, the data evaluation plan generated by the generative language model; invoking the application programming interface with another prompt instructing the generative language model to write code to implement one or more data evaluation actions of the data evaluation plan; receiving the code from the generative language model; and executing the code to implement the data evaluation plan, the executing identifying one or more data health issues in the data set by performing the one or more data evaluation actions. 2 . The method of claim 1 , wherein the one or more data health issues include one or more of invalid values in the data set, inconsistent formats in the data set, inconsistent semantic types in the data set, missing values in the data set, outliers in the data set, duplicate unique values in the data set, or inconsistent units in the data set. 3 . The method of claim 2 , further comprising: invoking the application programming interface with another prompt instructing the generative language model to generate a summary of the data set; and inputting the summary of the data set to the generative language model as the context describing the data set. 4 . The method of claim 3 , further comprising, by an automated summarization agent: invoking the application programming interface with a further prompt instructing the generative language model to generate one or more annotations for the data set using a name of the data set, a name of a field of the data set, or values in the data set as context; receiving the one or more annotations for the data set from the generative language model; and including the one or more annotations in the summary, wherein the one or more annotations produced by the generative language model are employed as the context for generating the data evaluation plan. 5 . The method of claim 4 , the annotations including a semantic description of the data set produced by the generative language model, semantic types of fields of the data set produced by the generative language model, and textual descriptions of the fields produced by the generative language model. 6 . The method of claim 1 , further comprising: by an automated aggregation and scoring agent, determining a data health score for the data set based at least on the one or more data health issues. 7 . The method of claim 6 , the data health score being determined by the automated aggregation and scoring agent using at least one of a severity dictionary or a regression model. 8 . The method of claim 1 , wherein the code written by the generative language model obtains samples from the data set and performs the one or more data evaluation actions of the data evaluation plan on the samples. 9 . The method of claim 8 , wherein the code performs one or more data cleaning actions on the data set. 10 . The method of claim 9 , wherein the one or more data cleaning actions include removing values from the data set or changing values in the data set. 11 . The method of claim 10 , further comprising, by the automated evaluation plan execution agent: invoking the application programming interface with a further prompt instructing the generative language model to determine whether the one or more data cleaning actions improve data quality of the data set; and responsive to a response from the generative language model indicating that a particular data cleaning action does not improve the data quality of the data set, performing a different data cleaning action on the data set. 12 . The method of claim 1 , wherein the generative language model comprises a transformer decoder neural network. 13 . The method of claim 12 , further comprising: performing pruning or distillation on another generative language model having another transformer decoder neural network to obtain the generative language model, the generative language model having fewer parameters than the another generative language model. 14 . A system comprising: a hardware processing unit; and a storage resource storing computer-readable instructions which, when executed by the hardware processing unit, cause the system to: access a data set; generate a prompt requesting a generative language model to generate a data evaluation plan for the data set, the prompt including a summary of the data set as context; invoke an application programming interface that provides the prompt to the generative language model; receive the data evaluation plan from the generative language model, the data evaluation plan including one or more data evaluation actions; invoke the application programming interface with another prompt instructing the generative language model to write code to implement the one or more data evaluation actions of the data evaluation plan; receive the code from the generative language model; and execute the code, wherein the code, when executed, implements the data evaluation plan by performing the one or more data evaluation actions, the one or more data evaluation actions identifying one or more data health issues in the data set. 15 . The system of claim 14 , wherein the computer-readable instructions, when executed by the hardware processing unit, cause the system to: invoke the application programming interface with another prompt instructing the generative language model to generate annotations of the data set; and provide the annotations as context to the generative language model for generation of the data evaluation plan. 16 . The system of claim 15 , wherein the annotations include a semantic description of the data set produced by the generative language model, semantic types of fields of the data set produced by the generative language model, and textual descriptions of the fields produced by the generative language model. 17 . The system of claim 16 , wherein the computer-readable instructions, when executed by the hardware processing unit, cause the system to: include, in the prompt, at least: data types of fields of the data set, and statistics for a particular field of the data set. 18 . The system of claim 17 , wherein the statistics include a minimum value, maximum value, and number of unique values of the particular field. 19 . A computer-readable storage medium storing computer-readable instructions which, when executed by a processing unit, cause the processing unit to perform acts comprising: accessing a data set; generating a prompt instructing a generative language model to generate a data evaluation plan for the data set, the prompt including context describing the data set; invoking an application programming interface that provides the prompt to the generative language model; receiving the data evaluation plan produced by the generative language model; invoking the application programming interface with another prompt instructing the generative language model to write code to implement one or more data evaluation actions of the data evaluation plan; receiving the code from the generative language model; and executing the code to imp

Assignees

Inventors

Classifications

  • Plan optimisation · CPC title

  • Discourse or dialogue representation · CPC title

  • Methods for reducing search complexity, pruning · CPC title

  • using artificial neural networks · CPC title

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12579115B2 cover?
The disclosed concepts relate to leveraging a language model to identify data health issues in a data set. One example method involves accessing a data set. The example method also involves, using an automated evaluation planning agent, inputting a prompt to generate a data evaluation plan for the data set to a generative language model, the prompt including context describing the data set. The…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 17 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).