Training of neural network based natural language processing models using dense knowledge distillation
US-2021182662-A1 · Jun 17, 2021 · US
US11797862B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11797862-B2 |
| Application number | US-202016749570-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 22, 2020 |
| Priority date | Jan 22, 2020 |
| Publication date | Oct 24, 2023 |
| Grant date | Oct 24, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to obtain optimal word embeddings for the student vocabulary. In some implementations, this approach can be combined with learning shared projection matrices that transfer layer-wise knowledge from the teacher language model to the student language model. Example experimental results have also demonstrated higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques, including the ability to compress the BERTBASE model by more than 60×, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7 MB.
Opening claim text (preview).
What is claimed is: 1. A computing system that performs language model compression, the computing system comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store: a teacher language model, wherein a teacher vocabulary that contains a plurality of teacher sub-words is associated with the teacher language model, and wherein a plurality of teacher sub-word embeddings are respectively associated with the plurality of teacher sub-words; a student language model, wherein a student vocabulary that contains a plurality of student sub-words is associated with the student language model, wherein a plurality of student sub-word embeddings are respectively associated with the plurality of student sub-words, and wherein a number of student sub-words contained in the student vocabulary is less than a number of teacher sub-words contained in the teacher vocabulary; and instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining a natural language training input; generating a first sub-word version of the natural language training input that comprises at least one of the teacher sub-word embeddings associated with the teacher vocabulary and at least one of the student sub-word embeddings associated with the student vocabulary; inputting the first sub-word version of the natural language training input into at least the teacher language model; receiving a teacher output generated by the teacher language model based on the first sub-word version of the natural language training input; evaluating a loss function to determine a loss associated with the teacher output; and modifying at least one of the plurality of student sub-word embeddings based at least in part on the loss associated with the teacher output. 2. The computing system of claim 1 , wherein the operations further comprise: generating a second sub-word version of the natural language training input that comprises only student sub-word embeddings associated with the student vocabulary; inputting the second sub-word version of the natural language training input into at least the student language model; receiving a student output generated by the student language model based on the second sub-word version of the natural language training input; evaluating a second loss function to determine a second loss associated with the student output; and modifying, based at least in part on the second loss associated with the student output, one or both of: at least one of the plurality of student sub-word embeddings; and at least one parameter value of at least one student parameter included in the student language model. 3. The computing system of claim 1 , wherein the operations further comprise: generating a second sub-word version of the natural language training input that comprises both teacher sub-word embeddings associated with the teacher vocabulary and student sub-word embeddings associated with the student vocabulary; inputting the second sub-word version of the natural language training input into at least the student language model; receiving a student output generated by the student language model based on the second sub-word version of the natural language training input; evaluating a second loss function to determine a second loss associated with the student output; and modifying at least one of the plurality of student sub-word embeddings based at least in part on the second loss associated with the teacher output. 4. The computing system of claim 1 , wherein: generating the first sub-word version of the natural language training input comprises masking at least one word of the natural language training input; and the teacher output comprises a prediction of the at least one word of the natural language training input that was masked within a pre-selected one of the teacher or student vocabularies. 5. The computing system of claim 1 , wherein the teacher language model and the student language model comprise respective Bidirectional Encoder Representations from Transformers (BERT) models. 6. The computing system of claim 1 , wherein each of the teacher language model and the student language model comprise one or more transformer layers, and wherein the operations further comprise: modifying at least one parameter value of at least one transformer layer of the student language model to reduce a different between the at least one transformer layer of the student language model and at least one transformer layer of the teacher language model when projected into a shared space. 7. The computing system of claim 1 , wherein the teacher language model and the student language model comprise an equal number of transformer layers, and wherein the student language model has a smaller number of parameters than the teacher language model. 8. The computing system of claim 1 , wherein the teacher language model applies two separate softmax layers to respectively make predictions over the student vocabulary and the teacher vocabulary. 9. The computing system of claim 1 , wherein each of the teacher vocabulary and the student vocabulary comprise respective sets of WordPiece tokens. 10. The computing system of claim 1 , wherein the operations further comprising: deploying the student language model to a mobile or edge device for on-device inference at the mobile or edge device. 11. The computing system of claim 1 , wherein generating the first sub-word version of the natural language training input comprises randomly selecting, according to a probability hyperparameter, tokens from the natural language training input to segment using the student vocabulary. 12. The computing system of claim 11 , wherein the computing system performs the operations for a plurality of iterations, and wherein the computing system ramps the probability hyperparameter over the plurality of iterations to increase a ratio of tokens that are selected for segmentation using the student vocabulary. 13. A computer-implemented method, the method comprising: obtaining data descriptive of a teacher vocabulary that contains a plurality of teacher sub-words and a student vocabulary that contains a plurality of student sub-words, wherein a plurality of teacher sub-word embeddings are respectively associated with the plurality of teacher sub-words and a plurality of student sub-word embeddings are respectively associated with the plurality of student sub-words, and wherein a number of student sub-words contained in the student vocabulary is less than a number of teacher sub-words contained in the teacher vocabulary; obtaining a natural language training input; generating a sub-word version of the natural language training input that comprises at least one of the teacher sub-word embeddings associated with a teacher vocabulary and at least one of the student sub-word embeddings associated with a student vocabulary; inputting the sub-word version of the natural language training input into a language model; receiving an output generated by the language model based on the sub-word version of the natural language training input; evaluating a loss function to determine a loss associated with the output; and modifying, based at least in part on the loss associated with the output, one or both of: at least one of the plurality of student sub-word embeddings; and at least one parameter value of the language model. 14. The computer-implemented method of claim 13 , wherein: generating the sub-word version of the natural language training input compri
Transfer learning · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
Supervised learning · CPC title
Non-supervised learning, e.g. competitive learning · CPC title
Lexical analysis, e.g. tokenisation or collocates · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.