Training of neural network based natural language processing models using dense knowledge distillation

US2021182662A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021182662-A1
Application numberUS-201916717698-A
CountryUS
Kind codeA1
Filing dateDec 17, 2019
Priority dateDec 17, 2019
Publication dateJun 17, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for training a first neural network (NN) model using a pre-trained second NN model are disclosed. In an example, training data is input to the first and second models. The training data includes masked tokens and unmasked tokens. In response, the first model generates a first prediction associated with a masked token and a second prediction associated with an unmasked token, and the second model generates a third prediction associated with the masked token and a fourth prediction associated with the unmasked token. The first model is trained, based at least in part on the first, second, third, and fourth predictions. In another example, a prediction associated with a masked token, a prediction associated with an unmasked token, and a prediction associated with whether two sentences of training data are adjacent sentences are received from each of the first and second models. The first model is trained using the predictions.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for training a first neural network (NN) model using a pre-trained second neural network (NN) model, the method comprising: inputting training data to both the first NN model that is to be trained and the second NN model that is pre-trained, the training data including a plurality of masked tokens and a plurality of unmasked tokens; generating, by the first NN model, a first prediction and a second prediction, the first prediction associated with a masked token of the training data, and the second prediction associated with an unmasked token of the training data; generating, by the second NN model, a third prediction and a fourth prediction, the third prediction associated with the masked token, and the fourth prediction associated with the unmasked token; and training the first NN model, based at least in part on the first prediction, the second prediction, the third prediction, and the fourth prediction. 2 . The method of claim 1 , wherein training the first NN model comprises: training the first NN model based at least in part on both (i) a comparison of the first prediction and the third prediction, and (ii) a comparison of the second prediction and the fourth prediction. 3 . The method of claim 1 , wherein training the first NN model comprises: generating a first loss function, based at least in part on a comparison of the first prediction and the third prediction; generating a second loss function, based at least in part on a comparison of the second prediction and the fourth prediction; and training the first NN model, based at least in part on the first loss function and the second loss function. 4 . The method of claim 3 , wherein training the first NN model comprises: tuning one or more parameters of the first NN model, to reduce the first loss and the second loss function. 5 . The method of claim 3 , wherein generating the first loss function comprises: generating a first logit for a first probability vector associated with the first prediction; generating a second logit for a second probability vector associated with the third prediction; and generating the first loss function to be a cross entropy between (i) the first function that is based at least in part on the first logit and (ii) the second function that is based at least in part on the second logit. 6 . The method of claim 5 , wherein: the first function is a softmax of a ratio of the first logit and a temperature hyperparameter of the first NN model; and the second function is a softmax of a ratio of the second logit and the temperature hyperparameter of the first NN model. 7 . The method of claim 1 , wherein: the first prediction is in the form of a first probability vector comprising two or more corresponding probability values for two or more words, respectively, for the masked token; the second prediction is in the form of a second probability vector comprising two or more corresponding probability values for two or more words, respectively, for the unmasked token; the third prediction is in the form of a third probability vector comprising corresponding two or more probability values for two or more words, respectively, for the masked token; and the fourth prediction is in the form of a fourth probability vector comprising corresponding two or more probability values for two or more words, respectively, for the unmasked token. 8 . The method of claim 1 , wherein the training data comprises a pair of sentences comprising a first sentence and a second sentence, and wherein the method comprises: generating, by the first NN model, a first probability vector comprising (i) a first value indicating a probability of the second sentence being logically adjacent to the first sentence, and (ii) a second value indicating a probability of the second sentence not being logically adjacent to the first sentence; and generating, by the second NN model, a second probability vector comprising (i) a third indicating a probability of the second sentence being logically adjacent to the first sentence, and (ii) a fourth value indicating a probability of the second sentence not being logically adjacent to the first sentence, wherein training the first NN model comprises training the first NN model, based at least in part on the first probability vector and the second probability vector. 9 . The method of claim 8 , wherein training the first NN model comprises training the first NN model, based at least in part on a cross entropy between (i) a first function that is based at least in part on the first probability vector and (ii) a second function that is based at least in part on the second probability vector. 10 . A system for training a first neural network (NN) model using a pre-trained second neural network (NN) model, the system comprising: one or more processors; and a model training system executable by the one or more processors to input training data to the first model and the second model, the training data including masked and unmasked tokens; receive predictions associated with masked and unmasked tokens from each of the first and second models; and train the first model, based at least in part on the predictions associated with the masked and unmasked tokens from each of the first and second models. 11 . The system of claim 10 , wherein to train the first model, the model training system is to: compare (i) a first function of a first prediction for an unmasked token from the first model and (ii) a second function of a second prediction for the unmasked token from the second model; compare (i) a third function of a third prediction for a masked token from the first model and (ii) a fourth function of a fourth prediction for the masked token from the second model; and train the first model, based at least in part on both (i) the comparison of the first function and the second function and (ii) the comparison of the third function and the fourth function. 12 . The system of claim 10 , wherein the training data has a first sentence and a second sentence, and wherein to train the first model, the model training system is to: receive, from each of the first and second models, predictions regarding whether the first sentence and the second sentence are adjacent sentences; and train the first mode, based at least in part on the predictions regarding whether the first sentence and the second sentence are adjacent sentences. 13 . The system of claim 10 , further comprising: the first model and the second model executable by the one or more processors to generate the predictions associated with the masked and unmasked tokens of the training data. 14 . The system of claim 10 , wherein the first model and the second model are Natural Language Processing (NLP) models configured to perform one or more NLP tasks. 15 . The system of claim 10 , wherein: the first model has a first storage size that is substantially less than a second storage size of the second model; and a number of tunable parameters in the first model is substantially less than that in the second model. 16 . The system of claim 10 , wherein: each of the first model and the second model is a corresponding one of a GPT (generative pre-training) model, OpenAI GPT-2 model, BERT (Bidirectional Encoder Representations from Transformers) model, ELMo (Embeddings from Language Model) model, or BiLSTM (bidirectional long short-term memory network) model. 17 . The system of claim 16 , wherein: the first model is of a different type than the second mod

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Generative networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021182662A1 cover?
Techniques for training a first neural network (NN) model using a pre-trained second NN model are disclosed. In an example, training data is input to the first and second models. The training data includes masked tokens and unmasked tokens. In response, the first model generates a first prediction associated with a masked token and a second prediction associated with an unmasked token, and the …
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jun 17 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).