Granular neural network architecture search over low-level primitives
US-2024428071-A1 · Dec 26, 2024 · US
US2021182662A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021182662-A1 |
| Application number | US-201916717698-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 17, 2019 |
| Priority date | Dec 17, 2019 |
| Publication date | Jun 17, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for training a first neural network (NN) model using a pre-trained second NN model are disclosed. In an example, training data is input to the first and second models. The training data includes masked tokens and unmasked tokens. In response, the first model generates a first prediction associated with a masked token and a second prediction associated with an unmasked token, and the second model generates a third prediction associated with the masked token and a fourth prediction associated with the unmasked token. The first model is trained, based at least in part on the first, second, third, and fourth predictions. In another example, a prediction associated with a masked token, a prediction associated with an unmasked token, and a prediction associated with whether two sentences of training data are adjacent sentences are received from each of the first and second models. The first model is trained using the predictions.
Opening claim text (preview).
What is claimed is: 1 . A method for training a first neural network (NN) model using a pre-trained second neural network (NN) model, the method comprising: inputting training data to both the first NN model that is to be trained and the second NN model that is pre-trained, the training data including a plurality of masked tokens and a plurality of unmasked tokens; generating, by the first NN model, a first prediction and a second prediction, the first prediction associated with a masked token of the training data, and the second prediction associated with an unmasked token of the training data; generating, by the second NN model, a third prediction and a fourth prediction, the third prediction associated with the masked token, and the fourth prediction associated with the unmasked token; and training the first NN model, based at least in part on the first prediction, the second prediction, the third prediction, and the fourth prediction. 2 . The method of claim 1 , wherein training the first NN model comprises: training the first NN model based at least in part on both (i) a comparison of the first prediction and the third prediction, and (ii) a comparison of the second prediction and the fourth prediction. 3 . The method of claim 1 , wherein training the first NN model comprises: generating a first loss function, based at least in part on a comparison of the first prediction and the third prediction; generating a second loss function, based at least in part on a comparison of the second prediction and the fourth prediction; and training the first NN model, based at least in part on the first loss function and the second loss function. 4 . The method of claim 3 , wherein training the first NN model comprises: tuning one or more parameters of the first NN model, to reduce the first loss and the second loss function. 5 . The method of claim 3 , wherein generating the first loss function comprises: generating a first logit for a first probability vector associated with the first prediction; generating a second logit for a second probability vector associated with the third prediction; and generating the first loss function to be a cross entropy between (i) the first function that is based at least in part on the first logit and (ii) the second function that is based at least in part on the second logit. 6 . The method of claim 5 , wherein: the first function is a softmax of a ratio of the first logit and a temperature hyperparameter of the first NN model; and the second function is a softmax of a ratio of the second logit and the temperature hyperparameter of the first NN model. 7 . The method of claim 1 , wherein: the first prediction is in the form of a first probability vector comprising two or more corresponding probability values for two or more words, respectively, for the masked token; the second prediction is in the form of a second probability vector comprising two or more corresponding probability values for two or more words, respectively, for the unmasked token; the third prediction is in the form of a third probability vector comprising corresponding two or more probability values for two or more words, respectively, for the masked token; and the fourth prediction is in the form of a fourth probability vector comprising corresponding two or more probability values for two or more words, respectively, for the unmasked token. 8 . The method of claim 1 , wherein the training data comprises a pair of sentences comprising a first sentence and a second sentence, and wherein the method comprises: generating, by the first NN model, a first probability vector comprising (i) a first value indicating a probability of the second sentence being logically adjacent to the first sentence, and (ii) a second value indicating a probability of the second sentence not being logically adjacent to the first sentence; and generating, by the second NN model, a second probability vector comprising (i) a third indicating a probability of the second sentence being logically adjacent to the first sentence, and (ii) a fourth value indicating a probability of the second sentence not being logically adjacent to the first sentence, wherein training the first NN model comprises training the first NN model, based at least in part on the first probability vector and the second probability vector. 9 . The method of claim 8 , wherein training the first NN model comprises training the first NN model, based at least in part on a cross entropy between (i) a first function that is based at least in part on the first probability vector and (ii) a second function that is based at least in part on the second probability vector. 10 . A system for training a first neural network (NN) model using a pre-trained second neural network (NN) model, the system comprising: one or more processors; and a model training system executable by the one or more processors to input training data to the first model and the second model, the training data including masked and unmasked tokens; receive predictions associated with masked and unmasked tokens from each of the first and second models; and train the first model, based at least in part on the predictions associated with the masked and unmasked tokens from each of the first and second models. 11 . The system of claim 10 , wherein to train the first model, the model training system is to: compare (i) a first function of a first prediction for an unmasked token from the first model and (ii) a second function of a second prediction for the unmasked token from the second model; compare (i) a third function of a third prediction for a masked token from the first model and (ii) a fourth function of a fourth prediction for the masked token from the second model; and train the first model, based at least in part on both (i) the comparison of the first function and the second function and (ii) the comparison of the third function and the fourth function. 12 . The system of claim 10 , wherein the training data has a first sentence and a second sentence, and wherein to train the first model, the model training system is to: receive, from each of the first and second models, predictions regarding whether the first sentence and the second sentence are adjacent sentences; and train the first mode, based at least in part on the predictions regarding whether the first sentence and the second sentence are adjacent sentences. 13 . The system of claim 10 , further comprising: the first model and the second model executable by the one or more processors to generate the predictions associated with the masked and unmasked tokens of the training data. 14 . The system of claim 10 , wherein the first model and the second model are Natural Language Processing (NLP) models configured to perform one or more NLP tasks. 15 . The system of claim 10 , wherein: the first model has a first storage size that is substantially less than a second storage size of the second model; and a number of tunable parameters in the first model is substantially less than that in the second model. 16 . The system of claim 10 , wherein: each of the first model and the second model is a corresponding one of a GPT (generative pre-training) model, OpenAI GPT-2 model, BERT (Bidirectional Encoder Representations from Transformers) model, ELMo (Embeddings from Language Model) model, or BiLSTM (bidirectional long short-term memory network) model. 17 . The system of claim 16 , wherein: the first model is of a different type than the second mod
Combinations of networks · CPC title
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
Generative networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.