Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06N3/091. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Mar 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Generating small language model via two-phase training

US2025086471A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2025086471-A1
Application number	US-202418733226-A
Country	US
Kind code	A1
Filing date	Jun 4, 2024
Priority date	Sep 11, 2023
Publication date	Mar 13, 2025
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for generating a small language model are provided. In particular, a computing device may obtain a general dataset including a plurality of general data, annotate a subset of the general dataset based on one or more classifier metrics indicative of a quality of the general dataset, train a classifier based on the annotated subset of the general dataset and the one or more classifier metrics, analyze each general data of the general dataset to determine a score for each of the one or more classifier metrics associated with the respective general data using the trained classifier, generate a filtered general dataset by filtering the general dataset based on one or more filters, train the small language model with the filtered general dataset, generate a synthetic dataset for refining the small language model, and train the small language model with the synthetic dataset.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for generating a small language model, the method comprising: obtaining a general dataset, the general dataset including a plurality of general data; annotating a subset of the general dataset based on one or more classifier metrics indicative of a quality of the general dataset, the subset of the general dataset being representative of the general dataset; training a classifier based on the annotated subset of the general dataset and the one or more classifier metrics; analyzing each general data of the general dataset to determine a score for each of the one or more classifier metrics associated with the respective general data using the trained classifier; generating a filtered general dataset by filtering the general dataset based on one or more filters, the one or more filters indicative of threshold scores for corresponding classifier metrics; training the small language model with the filtered general dataset; generating a synthetic dataset for refining the small language model; and subsequent to training the small language model with the filtered general dataset, training the small language model with the synthetic dataset. 2 . The method of claim 1 , wherein each general data of the general dataset is associated with a score for each of the one or more classifier metrics. 3 . The method of claim 1 , wherein the one or more classifier metrics comprise factual knowledge, everyday knowledge, scientific knowledge, human behavior, toxicity, completeness, obscenity, obscurity, commonality, reasoning, promotional content, and/or unwanted content. 4 . The method of claim 1 , wherein generating the filtered general dataset by filtering the general dataset based on the one or more filters comprises: generating the one or more filters for the one or more classifier metrics, each filter corresponding to a respective classifier metric and indicative of a threshold score assigned for the respective classifier metric; and filtering the general dataset based on the one or more filters. 5 . The method of claim 1 , wherein generating a synthetic dataset for refining the small language model comprises: identifying one or more deficit skills in the small language model; determining one or more data formats to address the one or more deficit skills; generating the one or more prompts for generating the one or more data formats; injecting sources of randomization and diversity in the one or more prompts; and generating the synthetic dataset based on the one or more prompts using a generative transformer, the synthetic dataset including the one or more data formats. 6 . The method of claim 5 , wherein the one or more deficit skills include any skill or topic for boosting the capability of the small language model. 7 . The method of claim 5 , wherein the generative transformer is a multimodal large language model. 8 . The method of claim 1 , further comprising: prior to training the small language model with the filtered general dataset, performing a warm start by copying weights from an existing trained model into the small language model. 9 . A computing device for generating a small language model, the computing device comprising: a processor; and a memory having a plurality of instructions stored thereon that, when executed by the processor, causes the computing device to: generate a small language model, the method comprising: obtain a general dataset, the general dataset including a plurality of general data; annotate a subset of the general dataset based on one or more classifier metrics indicative of a quality of the general dataset, the subset of the general dataset being representative of the general dataset; train a classifier based on the annotated subset of the general dataset and the one or more classifier metrics; analyze each general data of the general dataset to determine a score for each of the one or more classifier metrics associated with the respective general data using the trained classifier; generate a filtered general dataset by filtering the general dataset based on one or more filters, the one or more filters indicative of threshold scores for corresponding classifier metrics; train the small language model with the filtered general dataset; generate a synthetic dataset for refining the small language model; and subsequent to training of the small language model with the filtered general dataset, train the small language model with the synthetic dataset. 10 . The computing device of claim 9 , wherein each general data of the general dataset is associated with a score for each of the one or more classifier metrics. 11 . The computing device of claim 9 , wherein the one or more classifier metrics comprise factual knowledge, everyday knowledge, scientific knowledge, human behavior, toxicity, completeness, obscenity, obscurity, commonality, reasoning, promotional content, and/or unwanted content. 12 . The computing device of claim 9 , wherein to generate the filtered general dataset by filtering the general dataset based on the one or more filters comprises to: generate the one or more filters for the one or more classifier metrics, each filter corresponding to a respective classifier metric and indicative of a threshold score assigned for the respective classifier metric; and filter the general dataset based on the one or more filters. 13 . The computing device of claim 9 , to generate a synthetic dataset for refining the small language model comprises to: identify one or more deficit skills in the small language model; determine one or more data formats to address the one or more deficit skills; generate the one or more prompts for generating the one or more data formats; inject sources of randomization and diversity in the one or more prompts; and generate the synthetic dataset based on the one or more prompts using a generative transformer, the synthetic dataset including the one or more data formats. 14 . The computing device of claim 13 , wherein the one or more deficit skills include any skill or topic for boosting the capability of the small language model. 15 . The computing device of claim 9 , wherein the plurality of instructions, when executed, further cause the computing device to: prior to training of the small language model with the filtered general dataset, perform a warm start by copying weights from an existing trained model into the small language model. 16 . A computer storage medium storing computer-executable instructions that when executed cause at least one processor to perform operations, comprising: obtaining a general dataset, the general dataset including a plurality of general data; annotating a subset of the general dataset based on one or more classifier metrics indicative of a quality of the general dataset, the subset of the general dataset being representative of the general dataset; training a classifier based on the annotated subset of the general dataset and the one or more classifier metrics; analyzing each general data of the general dataset to determine a score for each of the one or more classifier metrics associated with the respective general data using the trained classifier; generating a filtered general dataset by filtering the general dataset based on one or more filters, the one or more filters indicative of threshold scores for corresponding classifier metrics; training the small language model with the filtered general dataset; generating a synthetic dataset for refining the small language model; and subsequent t

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06N3/045
Combinations of networks · CPC title
G06N3/0475
Generative networks · CPC title
G06N3/091Primary
Active learning · CPC title

Patent family

Related publications grouped by family.

View patent family 94872927

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2025086471A1 cover?: Systems and methods for generating a small language model are provided. In particular, a computing device may obtain a general dataset including a plurality of general data, annotate a subset of the general dataset based on one or more classifier metrics indicative of a quality of the general dataset, train a classifier based on the annotated subset of the general dataset and the one or more cl…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06N3/091. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Mar 13 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).