What technology area does this patent fall under?

Primary CPC classification G06N3/088. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 24 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Extreme language model compression with optimal sub-words and shared projections

US11797862B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11797862-B2
Application number	US-202016749570-A
Country	US
Kind code	B2
Filing date	Jan 22, 2020
Priority date	Jan 22, 2020
Publication date	Oct 24, 2023
Grant date	Oct 24, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to obtain optimal word embeddings for the student vocabulary. In some implementations, this approach can be combined with learning shared projection matrices that transfer layer-wise knowledge from the teacher language model to the student language model. Example experimental results have also demonstrated higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques, including the ability to compress the BERTBASE model by more than 60×, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7 MB.

First claim

Opening claim text (preview).

What is claimed is: 1. A computing system that performs language model compression, the computing system comprising: one or more processors; and one or more non-transitory computer-readable media that collectively store: a teacher language model, wherein a teacher vocabulary that contains a plurality of teacher sub-words is associated with the teacher language model, and wherein a plurality of teacher sub-word embeddings are respectively associated with the plurality of teacher sub-words; a student language model, wherein a student vocabulary that contains a plurality of student sub-words is associated with the student language model, wherein a plurality of student sub-word embeddings are respectively associated with the plurality of student sub-words, and wherein a number of student sub-words contained in the student vocabulary is less than a number of teacher sub-words contained in the teacher vocabulary; and instructions that, when executed by the one or more processors, cause the computing system to perform operations, the operations comprising: obtaining a natural language training input; generating a first sub-word version of the natural language training input that comprises at least one of the teacher sub-word embeddings associated with the teacher vocabulary and at least one of the student sub-word embeddings associated with the student vocabulary; inputting the first sub-word version of the natural language training input into at least the teacher language model; receiving a teacher output generated by the teacher language model based on the first sub-word version of the natural language training input; evaluating a loss function to determine a loss associated with the teacher output; and modifying at least one of the plurality of student sub-word embeddings based at least in part on the loss associated with the teacher output. 2. The computing system of claim 1 , wherein the operations further comprise: generating a second sub-word version of the natural language training input that comprises only student sub-word embeddings associated with the student vocabulary; inputting the second sub-word version of the natural language training input into at least the student language model; receiving a student output generated by the student language model based on the second sub-word version of the natural language training input; evaluating a second loss function to determine a second loss associated with the student output; and modifying, based at least in part on the second loss associated with the student output, one or both of: at least one of the plurality of student sub-word embeddings; and at least one parameter value of at least one student parameter included in the student language model. 3. The computing system of claim 1 , wherein the operations further comprise: generating a second sub-word version of the natural language training input that comprises both teacher sub-word embeddings associated with the teacher vocabulary and student sub-word embeddings associated with the student vocabulary; inputting the second sub-word version of the natural language training input into at least the student language model; receiving a student output generated by the student language model based on the second sub-word version of the natural language training input; evaluating a second loss function to determine a second loss associated with the student output; and modifying at least one of the plurality of student sub-word embeddings based at least in part on the second loss associated with the teacher output. 4. The computing system of claim 1 , wherein: generating the first sub-word version of the natural language training input comprises masking at least one word of the natural language training input; and the teacher output comprises a prediction of the at least one word of the natural language training input that was masked within a pre-selected one of the teacher or student vocabularies. 5. The computing system of claim 1 , wherein the teacher language model and the student language model comprise respective Bidirectional Encoder Representations from Transformers (BERT) models. 6. The computing system of claim 1 , wherein each of the teacher language model and the student language model comprise one or more transformer layers, and wherein the operations further comprise: modifying at least one parameter value of at least one transformer layer of the student language model to reduce a different between the at least one transformer layer of the student language model and at least one transformer layer of the teacher language model when projected into a shared space. 7. The computing system of claim 1 , wherein the teacher language model and the student language model comprise an equal number of transformer layers, and wherein the student language model has a smaller number of parameters than the teacher language model. 8. The computing system of claim 1 , wherein the teacher language model applies two separate softmax layers to respectively make predictions over the student vocabulary and the teacher vocabulary. 9. The computing system of claim 1 , wherein each of the teacher vocabulary and the student vocabulary comprise respective sets of WordPiece tokens. 10. The computing system of claim 1 , wherein the operations further comprising: deploying the student language model to a mobile or edge device for on-device inference at the mobile or edge device. 11. The computing system of claim 1 , wherein generating the first sub-word version of the natural language training input comprises randomly selecting, according to a probability hyperparameter, tokens from the natural language training input to segment using the student vocabulary. 12. The computing system of claim 11 , wherein the computing system performs the operations for a plurality of iterations, and wherein the computing system ramps the probability hyperparameter over the plurality of iterations to increase a ratio of tokens that are selected for segmentation using the student vocabulary. 13. A computer-implemented method, the method comprising: obtaining data descriptive of a teacher vocabulary that contains a plurality of teacher sub-words and a student vocabulary that contains a plurality of student sub-words, wherein a plurality of teacher sub-word embeddings are respectively associated with the plurality of teacher sub-words and a plurality of student sub-word embeddings are respectively associated with the plurality of student sub-words, and wherein a number of student sub-words contained in the student vocabulary is less than a number of teacher sub-words contained in the teacher vocabulary; obtaining a natural language training input; generating a sub-word version of the natural language training input that comprises at least one of the teacher sub-word embeddings associated with a teacher vocabulary and at least one of the student sub-word embeddings associated with a student vocabulary; inputting the sub-word version of the natural language training input into a language model; receiving an output generated by the language model based on the sub-word version of the natural language training input; evaluating a loss function to determine a loss associated with the output; and modifying, based at least in part on the loss associated with the output, one or both of: at least one of the plurality of student sub-word embeddings; and at least one parameter value of the language model. 14. The computer-implemented method of claim 13 , wherein: generating the sub-word version of the natural language training input compri

Assignees

Google Llc

Inventors

Classifications

G06N3/096
Transfer learning · CPC title
G06N3/0495
Quantised networks; Sparse networks; Compressed networks · CPC title
G06N3/09
Supervised learning · CPC title
G06N3/088Primary
Non-supervised learning, e.g. competitive learning · CPC title
G06F40/284Primary
Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

View patent family 76857907

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11797862B2 cover?: Provided is a knowledge distillation technique for training a student language model that, relative to a larger teacher language model, has a significantly smaller vocabulary, lower embedding dimensions, and/or hidden state dimensions. Specifically, aspects of the present disclosure are directed to a dual-training mechanism that trains the teacher and student language models simultaneously to o…
Who is the assignee on this patent?: Google Llc
What technology area does this patent fall under?: Primary CPC classification G06N3/088. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 24 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).