What technology area does this patent fall under?

Primary CPC classification G06N3/08. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jun 17 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Training of neural network based natural language processing models using dense knowledge distillation

US2021182662A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2021182662-A1
Application number	US-201916717698-A
Country	US
Kind code	A1
Filing date	Dec 17, 2019
Priority date	Dec 17, 2019
Publication date	Jun 17, 2021
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for training a first neural network (NN) model using a pre-trained second NN model are disclosed. In an example, training data is input to the first and second models. The training data includes masked tokens and unmasked tokens. In response, the first model generates a first prediction associated with a masked token and a second prediction associated with an unmasked token, and the second model generates a third prediction associated with the masked token and a fourth prediction associated with the unmasked token. The first model is trained, based at least in part on the first, second, third, and fourth predictions. In another example, a prediction associated with a masked token, a prediction associated with an unmasked token, and a prediction associated with whether two sentences of training data are adjacent sentences are received from each of the first and second models. The first model is trained using the predictions.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for training a first neural network (NN) model using a pre-trained second neural network (NN) model, the method comprising: inputting training data to both the first NN model that is to be trained and the second NN model that is pre-trained, the training data including a plurality of masked tokens and a plurality of unmasked tokens; generating, by the first NN model, a first prediction and a second prediction, the first prediction associated with a masked token of the training data, and the second prediction associated with an unmasked token of the training data; generating, by the second NN model, a third prediction and a fourth prediction, the third prediction associated with the masked token, and the fourth prediction associated with the unmasked token; and training the first NN model, based at least in part on the first prediction, the second prediction, the third prediction, and the fourth prediction. 2 . The method of claim 1 , wherein training the first NN model comprises: training the first NN model based at least in part on both (i) a comparison of the first prediction and the third prediction, and (ii) a comparison of the second prediction and the fourth prediction. 3 . The method of claim 1 , wherein training the first NN model comprises: generating a first loss function, based at least in part on a comparison of the first prediction and the third prediction; generating a second loss function, based at least in part on a comparison of the second prediction and the fourth prediction; and training the first NN model, based at least in part on the first loss function and the second loss function. 4 . The method of claim 3 , wherein training the first NN model comprises: tuning one or more parameters of the first NN model, to reduce the first loss and the second loss function. 5 . The method of claim 3 , wherein generating the first loss function comprises: generating a first logit for a first probability vector associated with the first prediction; generating a second logit for a second probability vector associated with the third prediction; and generating the first loss function to be a cross entropy between (i) the first function that is based at least in part on the first logit and (ii) the second function that is based at least in part on the second logit. 6 . The method of claim 5 , wherein: the first function is a softmax of a ratio of the first logit and a temperature hyperparameter of the first NN model; and the second function is a softmax of a ratio of the second logit and the temperature hyperparameter of the first NN model. 7 . The method of claim 1 , wherein: the first prediction is in the form of a first probability vector comprising two or more corresponding probability values for two or more words, respectively, for the masked token; the second prediction is in the form of a second probability vector comprising two or more corresponding probability values for two or more words, respectively, for the unmasked token; the third prediction is in the form of a third probability vector comprising corresponding two or more probability values for two or more words, respectively, for the masked token; and the fourth prediction is in the form of a fourth probability vector comprising corresponding two or more probability values for two or more words, respectively, for the unmasked token. 8 . The method of claim 1 , wherein the training data comprises a pair of sentences comprising a first sentence and a second sentence, and wherein the method comprises: generating, by the first NN model, a first probability vector comprising (i) a first value indicating a probability of the second sentence being logically adjacent to the first sentence, and (ii) a second value indicating a probability of the second sentence not being logically adjacent to the first sentence; and generating, by the second NN model, a second probability vector comprising (i) a third indicating a probability of the second sentence being logically adjacent to the first sentence, and (ii) a fourth value indicating a probability of the second sentence not being logically adjacent to the first sentence, wherein training the first NN model comprises training the first NN model, based at least in part on the first probability vector and the second probability vector. 9 . The method of claim 8 , wherein training the first NN model comprises training the first NN model, based at least in part on a cross entropy between (i) a first function that is based at least in part on the first probability vector and (ii) a second function that is based at least in part on the second probability vector. 10 . A system for training a first neural network (NN) model using a pre-trained second neural network (NN) model, the system comprising: one or more processors; and a model training system executable by the one or more processors to input training data to the first model and the second model, the training data including masked and unmasked tokens; receive predictions associated with masked and unmasked tokens from each of the first and second models; and train the first model, based at least in part on the predictions associated with the masked and unmasked tokens from each of the first and second models. 11 . The system of claim 10 , wherein to train the first model, the model training system is to: compare (i) a first function of a first prediction for an unmasked token from the first model and (ii) a second function of a second prediction for the unmasked token from the second model; compare (i) a third function of a third prediction for a masked token from the first model and (ii) a fourth function of a fourth prediction for the masked token from the second model; and train the first model, based at least in part on both (i) the comparison of the first function and the second function and (ii) the comparison of the third function and the fourth function. 12 . The system of claim 10 , wherein the training data has a first sentence and a second sentence, and wherein to train the first model, the model training system is to: receive, from each of the first and second models, predictions regarding whether the first sentence and the second sentence are adjacent sentences; and train the first mode, based at least in part on the predictions regarding whether the first sentence and the second sentence are adjacent sentences. 13 . The system of claim 10 , further comprising: the first model and the second model executable by the one or more processors to generate the predictions associated with the masked and unmasked tokens of the training data. 14 . The system of claim 10 , wherein the first model and the second model are Natural Language Processing (NLP) models configured to perform one or more NLP tasks. 15 . The system of claim 10 , wherein: the first model has a first storage size that is substantially less than a second storage size of the second model; and a number of tunable parameters in the first model is substantially less than that in the second model. 16 . The system of claim 10 , wherein: each of the first model and the second model is a corresponding one of a GPT (generative pre-training) model, OpenAI GPT-2 model, BERT (Bidirectional Encoder Representations from Transformers) model, ELMo (Embeddings from Language Model) model, or BiLSTM (bidirectional long short-term memory network) model. 17 . The system of claim 16 , wherein: the first model is of a different type than the second mod

Assignees

Adobe Inc

Inventors

Classifications

G06N3/045
Combinations of networks · CPC title
G06N3/0895
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
G06N3/0495
Quantised networks; Sparse networks; Compressed networks · CPC title
G06N3/0475
Generative networks · CPC title
G06N3/0442
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

Patent family

Related publications grouped by family.

View patent family 76318191

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021182662A1 cover?: Techniques for training a first neural network (NN) model using a pre-trained second NN model are disclosed. In an example, training data is input to the first and second models. The training data includes masked tokens and unmasked tokens. In response, the first model generates a first prediction associated with a masked token and a second prediction associated with an unmasked token, and the …
Who is the assignee on this patent?: Adobe Inc
What technology area does this patent fall under?: Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jun 17 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).