Dataset clustering via language model prompts

US2025094538A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2025094538-A1
Application numberUS-202418589323-A
CountryUS
Kind codeA1
Filing dateFeb 27, 2024
Priority dateSep 14, 2023
Publication dateMar 20, 2025
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Various embodiments discussed herein relate to prompting a model, such as a Large Language Model (LLM), to ingest natural language clustering instructions and generate corresponding natural language clustering information, such as a cluster description and/or a cluster label without the need to generate any numeric text embeddings.

First claim

Opening claim text (preview).

The invention claimed is: 1 . A system comprising: at least one computer processor; and one or more computer storage media storing computer-useable instructions that, when used by the at least one computer processor, cause the at least one computer processor to perform operations comprising: receiving a plurality of datasets, each dataset including a set of natural language characters; for each dataset, providing a representation of the set of natural language characters as a first input into a machine learning model, wherein the machine learning model generates a natural language summary of the set of natural language characters for each dataset; providing a representation of each natural language summary as a second input into the machine learning model, wherein the machine learning model generates a label associated with at least a first natural language summary based at least in part on the second input, the label including fewer natural language characters than the first natural language summary; based at least in part on the generating of the label, assigning a dataset, of the plurality of datasets, to the label; and based at least in part on the assigning of the dataset to the label, causing presentation, at a user device, of an indication of the assignment of the dataset to the label. 2 . The system of claim 1 , wherein each natural language summary is included among a plurality of natural language summaries, wherein the operations further comprising: parsing the plurality of natural language summaries into two or more batches; for a first batch, of the two or more batches, generating at least one of a respective label for a set of clusters or a respective description for of the set of clusters, the respective label including the label; and for every other batch, of the two or more batches that excludes the first batch, revise at least one of: the respective label or the respective description. 3 . The system of claim 1 , wherein each natural language summary is included among a plurality of natural language summaries, wherein the operations further comprising: parsing the plurality of natural language summaries into two or more batches; for a first batch, of the two or more batches, generating one or more of the label for a first cluster or a first description for the first cluster; and for every other batch, of the two or more batches that exclude the first batch, assign the label to a second cluster in the two or more batches. 4 . The system of claim 3 , wherein the operations further comprising: for every other batch, of the two or more batches that excludes the first batch, revise at least one of: the one the label or the description. 5 . The system of claim 1 , wherein the machine learning model generating the natural language summary of the set of natural language characters for each dataset is further based on the machine learning model ingesting a zero-shot prompt that includes an instruction to summarize a dataset according to a specific use case: the use case including one of: sentiment, user intent, and a topic of a conversation. 6 . The system of claim 1 , wherein the machine learning model generating the label associated with at least the first natural language summary is further based on the machine learning model ingesting a prompt that includes an instruction to generate at least one of: a group or category representing a cluster, a description of the group or category, and a label of the group or category according to a specific use case, the use case including one of: sentiment, user intent, and a topic of a conversation. 7 . The system of claim 1 , wherein the assigning of the dataset to the label is based on feeding the machine learning model at least one of: each dataset, each natural language summary of each dataset, a final updated label and description, and a label assignment instruction. 8 . The system of claim 1 , wherein the first input and the second input include zero-shot prompts and the machine learning model is not prompt-tuned or fine-tuned, and wherein the first input and the second input is not encoded in any numeric text embedding that relies on numeric space. 9 . The system of claim 1 , wherein the operations further comprising: subsequent to the assigning and the causing presentation, receiving a particular dataset, the particular dataset not being among the plurality of datasets; based at least in part on the providing of the set of natural language characters as the first input, the providing of each natural language summary as the second input, and the assigning of the dataset to the label, generating a score indicative of a prediction that at least a portion of the particular dataset belongs to the label; and based at least in part on the score, causing presentation, at the user device, of a second indication of the particular dataset belonging to the first label. 10 . A computer-implemented method comprising: receiving a dataset; receiving a natural language prompt that includes an instruction to generate, from the dataset, at least one of: a category representing a cluster according to a particular use case, a description that summarizes the cluster of the particular use case, or a label representing a name of the cluster; in response to the receiving of the natural language prompt, generating, via a machine learning model, at least one of: the category, the description, or the label; based at least in part on the generating, assigning the dataset to the label; and based at least in part on the assigning, causing presentation, at a user device, of an indication of the assignment of the dataset to the label. 11 . The computer-implemented method of claim 10 , wherein the dataset includes a plurality of natural language summaries, further comprising: parsing the plurality of natural language summaries into two or more batches; for a first batch, of the two or more batches, generating a respective label for each cluster, of a plurality of clusters, and generating a description for each cluster; and for every other batch, of the two or more batches that excludes the first batch, revise at least one of: the respective label of a first cluster or the description of the first cluster. 12 . The computer-implemented method of claim 11 , wherein the at least one dataset includes a plurality of natural language summaries, further comprising: parsing the plurality of natural language summaries into two or more batches; for a first batch, of the two or more batches, generating a respective label for each cluster, of a plurality of clusters, and generating a description for each cluster; and for every other batch, of the two or more batches that exclude the first batch, assign one of the respective labels to a first cluster in the two or more batches. 13 . The computer-implemented method of 12 , further comprising: for every other batch, of the two or more batches that excludes the first batch, revise at least one of: at least one of the respective labels of the first cluster or the description of the first cluster. 14 . The computer-implemented method of claim 10 , wherein the dataset represents a summary of a larger dataset, and wherein the summary is generated based at least in part on the machine learning model ingesting a zero-shot prompt that includes an instruction to summarize the dataset according to the specific use case: the use case including one of: sentiment, user intent, or a topic of a conversation. 15 . The computer-implemented method of claim 10 , wherein the generating is further based on the machine l

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025094538A1 cover?
Various embodiments discussed herein relate to prompting a model, such as a Large Language Model (LLM), to ingest natural language clustering instructions and generate corresponding natural language clustering information, such as a cluster description and/or a cluster label without the need to generate any numeric text embeddings.
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F18/23211. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Mar 20 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).