Systems and methods for distilled BERT-based training model for text classification

US11922303B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11922303-B2
Application numberUS-202016877339-A
CountryUS
Kind codeB2
Filing dateMay 18, 2020
Priority dateNov 18, 2019
Publication dateMar 5, 2024
Grant dateMar 5, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments described herein provides a training mechanism that transfers the knowledge from a trained BERT model into a much smaller model to approximate the behavior of BERT. Specifically, the BERT model may be treated as a teacher model, and a much smaller student model may be trained using the same inputs to the teacher model and the output from the teacher model. In this way, the student model can be trained within a much shorter time than the BERT teacher model, but with comparable performance with BERT.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for distilling knowledge from a first neural network to train a second neural network, the method comprising: receiving a plurality of training samples corresponding to a first set of pre-defined classes from a given dataset; retrieving the first neural network that is pre-trained to classify input samples into the first set of pre-defined classes; obtaining a first plurality of classifications by feeding the plurality of training samples to the first neural network; transforming, using an out-of-distribution (OOD) sample generation module, one or more of the plurality of the training samples into one or more out-of-distribution (OOD) training samples, wherein the transforming further includes: computing, using a term frequency-inverse document frequency model, inter-class word importance probabilities of one or more words of a training sample in the plurality of training samples, wherein the inter-class word importance probabilities indicate that the one or more words distinguish between the first plurality of classifications; computing, using a discriminator model, in-distribution word importance probabilities of the one or more words of a training sample in the plurality of training samples, wherein the in-distribution word importance probabilities indicate contributions of the one or more words to a classification in the first plurality of classifications; identifying a set of the one or more words within the training sample based on the inter-class word importance probabilities and the in-distribution word importance probabilities; and replacing the set of the one or more words within the training sample with one or more random words; generating a second set of classes by adding an out-of-distribution class to the first set of pre-defined classes; and training the second neural network defined with the second set of classes based on the plurality of training samples, the one or more out-of-distribution training samples and the first plurality of classifications from the first neural network. 2. The method of claim 1 , wherein the first neural network includes any combination of a bidirectional encoder representation from transformers (BERT) model and embeddings from language models (ELMO). 3. The method of claim 1 , wherein the second neural network has a smaller size than the first neural network, and the second neural network is implementable on a central processing unit. 4. The method of claim 1 , further comprising: training, using a customer dataset, the first neural network to classify input samples into the first set of pre-defined classes, wherein the customer dataset includes the plurality of training samples. 5. The method of claim 1 , wherein the training the second neural network defined with the second set of classes comprises: generating a second plurality of classification outputs by feeding the plurality of training samples to the second neural network; computing a knowledge distillation loss between the first plurality of classifications and the second plurality of classification outputs; and using backpropagation on the second neural network by the knowledge distillation loss to update parameters for the second neural network. 6. The method of claim 5 , further comprising: generating one or more additional classification outputs by feeding the one or more out-of-distribution training samples to the second neural network; computing a loss metric between the one or more additional classification outputs and a classification distribution corresponding to the added out-of-distribution class; and incorporating the loss metric into the knowledge distillation loss. 7. The method of claim 1 , wherein the training the second neural network defined with the second set of classes further comprises: preprocessing the plurality of training samples or the one or more out-of-distribution training samples by adding a Gaussian noise component before feeding the plurality of training samples or the one or more out-of-distribution training samples to the second neural network. 8. The method of claim 1 , wherein the training the second neural network defined with the second set of classes further comprises: generating a number of reference class vectors corresponding to the first set of pre-defined classes; and determining whether an input sample belongs to the added out-of-distribution class based on whether a vector representation of the input sample is orthogonal to the number of reference class vectors. 9. The method of claim 1 , wherein the training the second neural network defined with the second set of classes further comprises: training the second neural network using the plurality of training samples having a first feature dimension; in response to receiving an input sample having the first feature dimension, using a Gaussian distribution based sparsification vector to reduce the first feature dimension to a second feature dimension; and generating, via the second neural network, an output based on the input sample having the second feature dimension. 10. A system for distilling knowledge from a first neural network to train a second neural network, the system comprising: a communication interface that receives a plurality of training samples; a memory containing machine readable medium storing machine executable code; and one or more processors coupled to the memory and configurable to execute the machine executable code to cause the one or more processors to: receive a plurality of training samples corresponding to a first set of pre-defined classes from a given dataset; retrieve the first neural network that is pre-trained to classify input samples into the first set of pre-defined classes; obtain a first plurality of classifications by feeding the plurality of training samples to the first neural network; transform, using an out-of-distribution (OOD) sample generation module, one or more of the plurality of the training samples into one or more out-of-distribution training samples, wherein the transformation further includes: computing, using a term frequency-inverse document frequency model, inter-class word importance probabilities of one or more words of a training sample in the plurality of training samples, wherein the inter-class word importance probabilities indicate that one or more words distinguish between the first plurality of classifications; computing, using a discriminator model, in-distribution word importance probabilities of the one or more words of a training sample in the plurality of training samples, wherein the in-distribution word importance probabilities indicate contributions of the one or more words to a classification in the first plurality of classifications; identifying a set of the one or more words within the training sample based on the inter-class word importance probabilities and the in-distribution word importance probabilities; and replacing the set of the one or more words within the training sample with one or more random words; generate a second set of classes by adding an out-of-distribution class to the first set of pre-defined classes; and train the second neural network defined with the second set of classes based on the plurality of training samples, the one or more out-of-distribution training samples and the first plurality of classifications from the first neural network. 11. The system of claim 10 , wherein the first neural network includes any combination of a bidirectional encoder representation from transformers (BERT) model and embeddings from language models (ELMO). 12. The system of claim 10 , wherein the second neural network

Assignees

Inventors

Classifications

  • Transfer learning · CPC title

  • Supervised learning · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11922303B2 cover?
Embodiments described herein provides a training mechanism that transfers the knowledge from a trained BERT model into a much smaller model to approximate the behavior of BERT. Specifically, the BERT model may be treated as a teacher model, and a much smaller student model may be trained using the same inputs to the teacher model and the output from the teacher model. In this way, the student m…
Who is the assignee on this patent?
Salesforce Com Inc, Salesforce Inc
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 05 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).