What technology area does this patent fall under?

Primary CPC classification G06F18/2415. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Dec 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method for automatically compressing multitask-oriented pre-trained language model and platform thereof

US11526774B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11526774-B2
Application number	US-202117564071-A
Country	US
Kind code	B2
Filing date	Dec 28, 2021
Priority date	Dec 15, 2020
Publication date	Dec 13, 2022
Grant date	Dec 13, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed is a method for automatically compressing multi-task oriented pre-trained language model and a platform thereof. According to the method, a meta-network of a structure generator is designed, a knowledge distillation coding vector is constructed based on a knowledge distillation method of Transformer layer sampling, and a distillation structure model corresponding to a currently input coding vector is generated by using the structure generator; at the same time, a Bernoulli distribution sampling method is provided for training the structure generator; in each iteration, each encoder unit is transferred by Bernoulli distribution sampling to form a corresponding coding vector; by changing the coding vector input to the structure generator and a small batch of training data, the structure generator and the corresponding distillation structure are jointly trained, and a structure generator capable of generating weights for different distillation structures can be acquired.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for automatic compressing multi-task oriented pre-trained language model, comprising the following three stages: a first stage of constructing a knowledge distillation coding vector based on Transformer layer sampling: layer-sampling all Transformer units of a BERT model by Bernoulli distribution to generate the knowledge distillation coding vector; a second stage of training a knowledge distillation network of meta-learning comprising: generating a filtered knowledge distillation coding vector by: defining a search space, inputting the knowledge distillation coding vector constructed in the first stage into the search space, and removing unqualified knowledge distillation coding vectors; defining a structure generator, which takes the filtered knowledge distillation coding vector as an input, outputs a weight matrix for constructing a distillation structure model, and generates the corresponding distillation structure model; training the generated distillation structure model to update the structure generator; a third stage of searching the distillation structure model based on an evolutionary algorithm comprising: inputting a plurality of knowledge distillation coding vectors satisfying specific constraints into the updated structure generator in the second stage to generate the corresponding weight matrices to obtain a plurality of distillation structure models each based on one of the corresponding weight matrices; evaluating the accuracy of each of the plurality of distillation structure models; using the evolutionary algorithm to search the distillation structure model with the highest accuracy that meets the specific constraints, and obtaining a common compression structure. 2. The method for automatic compressing multi-task oriented pre-trained language model according to claim 1 , wherein the first stage comprises: sequentially carrying out Bernoulli sampling on 12 layers of Transformer units of the BERT model to generate the knowledge distillation coding vector, each layer corresponding to a random variable; wherein when a probability of the random variable being 1 is greater than or equal to 0.5, an element corresponding to the knowledge distillation coding vector is 1, which represents that a current Transformer unit performs transfer learning; and when a probability value of the random variable being 1 is less than 0.5, the element corresponding to the layer sampling vector is 0, which represents that the current Transformer unit does not perform transfer learning. 3. The method for automatic compressing multi-task oriented pre-trained language model according to claim 2 , wherein the step of defining a search space is that the number of elements being 1 in the knowledge distillation coding vector is not less than 6. 4. The method for automatic compressing multi-task oriented pre-trained language model according to claim 3 , wherein the step of defining a structure generator comprises that the structure generator consists of two fully connected layers, the input of which is the knowledge distillation coding vector constructed in the first stage, and the output of which is the weight matrix for generating the distillation structure model. 5. The method for automatic compressing multi-task oriented pre-trained language model according to claim 4 , wherein the step of training the generated distillation structure model to update the structure generator comprises the following substeps: step (2.1): inputting the knowledge distillation coding vector into the structure generator and outputting the weight matrix; step (2.2): constructing the distillation structure model based on the weight matrix output by the structure generator; step (2.3): jointly training the structure generator and the distillation structure model: inputting the training data into the distillation structure model generated in step (2.2) for model training, and updating the structure generator together; meanwhile, training the structure generator by combining a Bernoulli distribution sampling method. 6. The method for automatic compressing multi-task oriented pre-trained language model according to claim 5 , wherein the step (2.2) comprises: performing layer sampling knowledge distillation on each Transformer layer of a teacher network according to the knowledge distillation coding vector constructed in the first stage, wherein each element corresponds to a layer of Transformer units, initializing the Transformer units transferred by a student model by using a weight of the Transformer unit with an element corresponding to the knowledge distillation coding vector being 1 in the teacher model, the Transformer unit corresponding to the student model and the weight thereof are generated from each element with a layer sampling being 1 through the structure generator; establishing a one-to-one mapping relationship between the teacher model and the student model through the knowledge distillation coding vector, and generating a corresponding distillation network structure according to the knowledge distillation coding vector. 7. The method for automatic compressing multi-task oriented pre-trained language model according to claim 6 , wherein the step of training the structure generator by combining a Bernoulli distribution sampling method specifically comprises: using Bernoulli distribution to perform layer sampling for the Transformer units in each layer to construct different knowledge distillation coding vectors, using a training data set to carry out multiple iterative trainings, training the structure generator and the distillation structure model simultaneously based on one knowledge distillation coding vector in each iteration, and acquiring the structure generator capable of generating weight matrices for different distillation structure models by changing the input knowledge distillation coding vectors. 8. The method for automatic compressing multi-task oriented pre-trained language model according to claim 7 , wherein the third stage comprises the following substeps: step (3.1): defining the knowledge distillation coding vector as genes of the distillation structure model, and randomly selecting a series of genes satisfying specific constraints as an initial population; step (3.2): evaluating the accuracy of the distillation structure model corresponding to each gene in an existing population, and selecting top k genes with a higher accuracy; step (3.3): using the top k genes with a higher accuracy selected in step (3.2) for gene recombination and gene mutation to generate new genes, and adding the new genes into the existing population; step (3.4): repeating and iterating steps (3.2) to (3.3) for a set number of rounds, selecting the top k genes with a higher accuracy in the existing population and generating new genes, and finally obtaining the genes with the highest accuracy that meet the specific constraints. 9. The method for automatic compressing multi-task oriented pre-trained language model according to claim 8 , wherein in step (3.3), gene mutation refers to randomly changing the values of some elements in the gene; gene recombination refers to randomly recombining the genes of two parents; new genes that do not meet the specific constraints are eliminated. 10. A platform based on the method for automatic compressing multi-task oriented pre-trained language model according to claim 1 , comprising: at least one processors, and a memory coupled to the at least one processors, wherein the memory stores programmable instructions which cause the at least one processor to: load obtain training samples of multi-task oriented pre-trained language model, wherein the training samples are tagged text samples tha

Assignees

Zhejiang Lab

Inventors

Classifications

G06F18/214
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
G06N3/045
Combinations of networks · CPC title
G06F18/2415Primary
based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate · CPC title
G06N3/063
using electronic means · CPC title
G06N3/086
using evolutionary algorithms, e.g. genetic algorithms or genetic programming · CPC title

Patent family

Related publications grouped by family.

View patent family 81942583

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11526774B2 cover?: Disclosed is a method for automatically compressing multi-task oriented pre-trained language model and a platform thereof. According to the method, a meta-network of a structure generator is designed, a knowledge distillation coding vector is constructed based on a knowledge distillation method of Transformer layer sampling, and a distillation structure model corresponding to a currently input …
Who is the assignee on this patent?: Zhejiang Lab
What technology area does this patent fall under?: Primary CPC classification G06F18/2415. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Dec 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Knowledge distillation for neural networks using multiple augmentation strategies

Compression method and platform of pre-training language model based on knowledge distillation

Method, System, and Computer Program Product for Local Approximation of a Predictive Model

Frequently asked questions