Method for automatically compressing multitask-oriented pre-trained language model and platform thereof

US11526774B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11526774-B2
Application numberUS-202117564071-A
CountryUS
Kind codeB2
Filing dateDec 28, 2021
Priority dateDec 15, 2020
Publication dateDec 13, 2022
Grant dateDec 13, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed is a method for automatically compressing multi-task oriented pre-trained language model and a platform thereof. According to the method, a meta-network of a structure generator is designed, a knowledge distillation coding vector is constructed based on a knowledge distillation method of Transformer layer sampling, and a distillation structure model corresponding to a currently input coding vector is generated by using the structure generator; at the same time, a Bernoulli distribution sampling method is provided for training the structure generator; in each iteration, each encoder unit is transferred by Bernoulli distribution sampling to form a corresponding coding vector; by changing the coding vector input to the structure generator and a small batch of training data, the structure generator and the corresponding distillation structure are jointly trained, and a structure generator capable of generating weights for different distillation structures can be acquired.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for automatic compressing multi-task oriented pre-trained language model, comprising the following three stages: a first stage of constructing a knowledge distillation coding vector based on Transformer layer sampling: layer-sampling all Transformer units of a BERT model by Bernoulli distribution to generate the knowledge distillation coding vector; a second stage of training a knowledge distillation network of meta-learning comprising: generating a filtered knowledge distillation coding vector by: defining a search space, inputting the knowledge distillation coding vector constructed in the first stage into the search space, and removing unqualified knowledge distillation coding vectors; defining a structure generator, which takes the filtered knowledge distillation coding vector as an input, outputs a weight matrix for constructing a distillation structure model, and generates the corresponding distillation structure model; training the generated distillation structure model to update the structure generator; a third stage of searching the distillation structure model based on an evolutionary algorithm comprising: inputting a plurality of knowledge distillation coding vectors satisfying specific constraints into the updated structure generator in the second stage to generate the corresponding weight matrices to obtain a plurality of distillation structure models each based on one of the corresponding weight matrices; evaluating the accuracy of each of the plurality of distillation structure models; using the evolutionary algorithm to search the distillation structure model with the highest accuracy that meets the specific constraints, and obtaining a common compression structure. 2. The method for automatic compressing multi-task oriented pre-trained language model according to claim 1 , wherein the first stage comprises: sequentially carrying out Bernoulli sampling on 12 layers of Transformer units of the BERT model to generate the knowledge distillation coding vector, each layer corresponding to a random variable; wherein when a probability of the random variable being 1 is greater than or equal to 0.5, an element corresponding to the knowledge distillation coding vector is 1, which represents that a current Transformer unit performs transfer learning; and when a probability value of the random variable being 1 is less than 0.5, the element corresponding to the layer sampling vector is 0, which represents that the current Transformer unit does not perform transfer learning. 3. The method for automatic compressing multi-task oriented pre-trained language model according to claim 2 , wherein the step of defining a search space is that the number of elements being 1 in the knowledge distillation coding vector is not less than 6. 4. The method for automatic compressing multi-task oriented pre-trained language model according to claim 3 , wherein the step of defining a structure generator comprises that the structure generator consists of two fully connected layers, the input of which is the knowledge distillation coding vector constructed in the first stage, and the output of which is the weight matrix for generating the distillation structure model. 5. The method for automatic compressing multi-task oriented pre-trained language model according to claim 4 , wherein the step of training the generated distillation structure model to update the structure generator comprises the following substeps: step (2.1): inputting the knowledge distillation coding vector into the structure generator and outputting the weight matrix; step (2.2): constructing the distillation structure model based on the weight matrix output by the structure generator; step (2.3): jointly training the structure generator and the distillation structure model: inputting the training data into the distillation structure model generated in step (2.2) for model training, and updating the structure generator together; meanwhile, training the structure generator by combining a Bernoulli distribution sampling method. 6. The method for automatic compressing multi-task oriented pre-trained language model according to claim 5 , wherein the step (2.2) comprises: performing layer sampling knowledge distillation on each Transformer layer of a teacher network according to the knowledge distillation coding vector constructed in the first stage, wherein each element corresponds to a layer of Transformer units, initializing the Transformer units transferred by a student model by using a weight of the Transformer unit with an element corresponding to the knowledge distillation coding vector being 1 in the teacher model, the Transformer unit corresponding to the student model and the weight thereof are generated from each element with a layer sampling being 1 through the structure generator; establishing a one-to-one mapping relationship between the teacher model and the student model through the knowledge distillation coding vector, and generating a corresponding distillation network structure according to the knowledge distillation coding vector. 7. The method for automatic compressing multi-task oriented pre-trained language model according to claim 6 , wherein the step of training the structure generator by combining a Bernoulli distribution sampling method specifically comprises: using Bernoulli distribution to perform layer sampling for the Transformer units in each layer to construct different knowledge distillation coding vectors, using a training data set to carry out multiple iterative trainings, training the structure generator and the distillation structure model simultaneously based on one knowledge distillation coding vector in each iteration, and acquiring the structure generator capable of generating weight matrices for different distillation structure models by changing the input knowledge distillation coding vectors. 8. The method for automatic compressing multi-task oriented pre-trained language model according to claim 7 , wherein the third stage comprises the following substeps: step (3.1): defining the knowledge distillation coding vector as genes of the distillation structure model, and randomly selecting a series of genes satisfying specific constraints as an initial population; step (3.2): evaluating the accuracy of the distillation structure model corresponding to each gene in an existing population, and selecting top k genes with a higher accuracy; step (3.3): using the top k genes with a higher accuracy selected in step (3.2) for gene recombination and gene mutation to generate new genes, and adding the new genes into the existing population; step (3.4): repeating and iterating steps (3.2) to (3.3) for a set number of rounds, selecting the top k genes with a higher accuracy in the existing population and generating new genes, and finally obtaining the genes with the highest accuracy that meet the specific constraints. 9. The method for automatic compressing multi-task oriented pre-trained language model according to claim 8 , wherein in step (3.3), gene mutation refers to randomly changing the values of some elements in the gene; gene recombination refers to randomly recombining the genes of two parents; new genes that do not meet the specific constraints are eliminated. 10. A platform based on the method for automatic compressing multi-task oriented pre-trained language model according to claim 1 , comprising: at least one processors, and a memory coupled to the at least one processors, wherein the memory stores programmable instructions which cause the at least one processor to: load obtain training samples of multi-task oriented pre-trained language model, wherein the training samples are tagged text samples tha

Assignees

Inventors

Classifications

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Combinations of networks · CPC title

  • based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate · CPC title

  • using electronic means · CPC title

  • using evolutionary algorithms, e.g. genetic algorithms or genetic programming · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11526774B2 cover?
Disclosed is a method for automatically compressing multi-task oriented pre-trained language model and a platform thereof. According to the method, a meta-network of a structure generator is designed, a knowledge distillation coding vector is constructed based on a knowledge distillation method of Transformer layer sampling, and a distillation structure model corresponding to a currently input …
Who is the assignee on this patent?
Zhejiang Lab
What technology area does this patent fall under?
Primary CPC classification G06F18/2415. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).