Who is the assignee on this patent?

Microsoft Technology Licensing Llc

What technology area does this patent fall under?

Primary CPC classification G06F40/56. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 30 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Pre-training a unified natural language model with corrupted span and replaced token detection

US12511498B2 · US · B2

Patent metadata
Field	Value
Publication number	US-12511498-B2
Application number	US-202217970174-A
Country	US
Kind code	B2
Filing date	Oct 20, 2022
Priority date	Aug 16, 2022
Publication date	Dec 30, 2025
Grant date	Dec 30, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods are provided for training and using a novel unified language foundation model. An encoder-decoder natural language model is obtained and various training data is obtained and used for training. The training process integrates a combination of replaced token detection, corrupted span reconstruction, and disentangled attention methodologies to produce a unified encoder-decoder model. The trained model is trained for performing both natural language understanding (NLU) tasks and natural language generation (NLG) tasks. Attention applied to the model is applied discretely to segmented chunks of encoded data during processing to improve the efficiency of applying attention by the model.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method for performing a two-step pre-training process for a natural language model, the method comprising: accessing the natural language model; obtaining a set of training data; creating a set of tokens for the training data; as a part of a first step of the two-step pre-training process, generating corrupted span data from the set of tokens by masking a subset of the set of tokens; as a part of a second step of the two-step pre-training process, generating replacement token detection span data by replacing the subset of masked tokens in the corrupted span data with a set of ambiguous tokens; incrementally modifying the corrupted span data by replacing, in an incremental manner, other masked tokens in the corrupted span data with replacement tokens, wherein said incremental modification of the corrupted span data is incrementally performed until a total number of the replacement tokens that are included in the corrupted span data reaches at least a specified percentage relative to a total number of tokens included in the corrupted span data, and wherein the specified percentage for the total number of replacement tokens relative to the total number of tokens is at least 10%; and training an encoder of the natural language model with the replacement token detection span data and the incrementally modified corrupted span data. 2 . The computer-implemented method of claim 1 , further comprising: using the trained natural language model to perform a natural language generation task, the natural language generation task comprising abstractive document summarization, conversational summarization, data to text, cross-lingual summarization, or multi-lingual question answering. 3 . The computer-implemented method of claim 1 , wherein the encoder is trained with the corrupted span data first and the replacement token detection span data second. 4 . The computer-implemented method of claim 1 , wherein the encoder is trained with the corrupted span data and the replacement token detection span data simultaneously. 5 . The computer-implemented method of claim 1 , further comprising: training a decoder of the natural language model with the corrupted span data. 6 . The computer-implemented method of claim 1 , wherein the masked subset of the set of tokens consists of between 1% and 15% of the set of tokens. 7 . The computer-implemented method of claim 1 , further comprising: applying disentangled attention to the set of training data when generating the set of tokens. 8 . The computer-implemented method of claim 1 , further comprising: using the trained natural language model to perform a natural language understanding task, the natural language understanding task comprising sentence classification, multi-lingual sentence classification, or multi-lingual question answer. 9 . A computer system comprising: one or more processors; and one or more hardware storage devices that store instructions that are executable by the one or more processors to cause the computer system to: access a natural language model; obtain a set of training data; create a set of tokens for the training data; as a part of a first step of the two-step pre-training process, generate corrupted span data from the set of tokens by masking a subset of the set of tokens; as a part of a second step of the two-step pre-training process, generate replacement token detection span data by replacing the subset of masked tokens in the corrupted span data with a set of ambiguous tokens; incrementally modify the corrupted span data by replacing, in an incremental manner, other masked tokens in the corrupted span data with replacement tokens wherein said incremental modification of the corrupted span data is incrementally performed until a total number of the replacement tokens that are included in the corrupted span data reaches at least a specified percentage relative to a total number of tokens included in the corrupted span data, and wherein the specified percentage for the total number of replacement tokens relative to the total number of tokens is at least 10%; and train an encoder of the natural language model with the replacement token detection span data and the corrupted span data. 10 . The computer system of claim 9 , wherein the instructions are further executable to cause the computer system to: use the trained natural language model to perform a natural language generation task, wherein the natural language generation task includes abstractive document summarization. 11 . The computer system of claim 9 , wherein the instructions are further executable to cause the computer system to: use the trained natural language model to perform a natural language generation task, wherein the natural language generation task includes conversational summarization. 12 . The computer system of claim 9 , wherein the instructions are further executable to cause the computer system to: use the trained natural language model to perform a natural language generation task, wherein the natural language generation task includes data to text. 13 . The computer system of claim 9 , wherein the instructions are further executable to cause the computer system to: use the trained natural language model to perform a natural language generation task, wherein the natural language generation task includes cross-lingual summarization. 14 . The computer system of claim 9 , wherein the instructions are further executable to cause the computer system to: use the trained natural language model to perform a natural language generation task, wherein the natural language generation task includes multi-lingual question answering. 15 . The computer system of claim 9 , wherein the instructions are further executable to cause the computer system to: use the trained natural language model to perform a natural language generation task, wherein the natural language generation task includes abstractive document summarization or conversational summarization. 16 . The computer system of claim 9 , wherein the instructions are further executable to cause the computer system to: use the trained natural language model to perform a natural language generation task, wherein the natural language generation task includes abstractive document summarization or conversational summarization or data to text. 17 . One or more hardware storage devices that store instructions that are executable by one or more processors to cause the one or more processors to: access a natural language model; obtain a set of training data; create a set of tokens for the training data; as a part of a first step of the two-step pre-training process, generate corrupted span data from the set of tokens by masking a subset of the set of tokens; as a part of a second step of the two-step pre-training process, generate replacement token detection span data by replacing the subset of masked tokens in the corrupted span data with a set of ambiguous tokens; incrementally modify the corrupted span data by replacing, in an incremental manner, other masked tokens in the corrupted span data with replacement tokens, wherein said incremental modification of the corrupted span data is incrementally performed until a total number of the replacement tokens that are included in the corrupted span data reaches at least a specified percentage relative to a total number of tokens included in the corrupted span data, and wherein the specified percentage for the total number of replacement tokens relat

Assignees

Microsoft Technology Licensing Llc

Inventors

Classifications

G06F40/40
Processing or translation of natural language (natural language analysis G06F40/20; semantic analysis G06F40/30) · CPC title
G06F40/149
Adaptation of the text data for streaming purposes, e.g. Efficient XML Interchange [EXI] format · CPC title
G06F40/284
Lexical analysis, e.g. tokenisation or collocates · CPC title
G06F40/51
Translation evaluation · CPC title
G06F40/56Primary
Natural language generation · CPC title

Patent family

Related publications grouped by family.

View patent family 89906927

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12511498B2 cover?: Systems and methods are provided for training and using a novel unified language foundation model. An encoder-decoder natural language model is obtained and various training data is obtained and used for training. The training process integrates a combination of replaced token detection, corrupted span reconstruction, and disentangled attention methodologies to produce a unified encoder-decoder…
Who is the assignee on this patent?: Microsoft Technology Licensing Llc
What technology area does this patent fall under?: Primary CPC classification G06F40/56. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 30 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).