Slim embedding layers for recurrent neural language models

US11030997B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11030997-B2
Application numberUS-201816197945-A
CountryUS
Kind codeB2
Filing dateNov 21, 2018
Priority dateNov 22, 2017
Publication dateJun 8, 2021
Grant dateJun 8, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described herein are systems and methods for compressing or otherwise reducing the memory requirements for storing and computing the model parameters in recurrent neural language models. Embodiments include space compression methodologies that share the structured parameters at the input embedding layer, the output embedding layers, or both of a recurrent neural language model to significantly reduce the size of model parameters, but still compactly represent the original input and output embedding layers. Embodiments of the methodology are easy to implement and tune. Experiments on several data sets show that embodiments achieved similar perplexity and BLEU score results while only using a fraction of the parameters.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for compressing a matrix of a neural network model, the method comprising: for each vector from a set of vectors from the matrix, dividing the vector from the matrix into a plurality of parts; for each part of the vector from the matrix, mapping the part to a substitute sub-vector, which comprises one or more parameters, wherein the substitute sub-vector is selected from a set of substitute sub-vectors, which set has fewer substitute sub-vectors than there are parts mapped from the matrix; and training the neural network model using the mapped substitute sub-vectors until a stop condition is reached. 2. The computer-implemented method of claim 1 wherein the vector is an input word embedding vector, an output word embedding vector, or both. 3. The computer-implemented method of claim 1 wherein the step of mapping the part to a substitute sub-vector further comprises forming a mapping table and the mapping table is fixed during training but the one or more parameters of each substitute sub-vector are subject to updating during training. 4. The computer-implemented method of claim 1 wherein the step of mapping the part to a substitute sub-vector comprises: initializing a list of substitute sub-vector indicators, the list comprising a same number of entries for a sub-vector indicator as there are parts in the vector; shuffling the list; and generating a mapping table from the shuffled list. 5. The computer-implemented method of claim 4 wherein the list is randomly shuffled. 6. The computer-implemented method of claim 1 wherein the step of mapping the part to a substitute sub-vector comprises: using a pre-trained matrix to estimate sub-vectors to facilitate mapping of parts of the matrix with similar estimated sub-vectors to the same substitute sub-vector. 7. The computer-implemented method of claim 6 wherein the step of using a pre-trained matrix to estimate sub-vectors to facilitate mapping of parts of the matrix with similar estimated sub-vectors to the same substitute sub-vector comprises: clustering parts of the pre-trained source embedding matrix into a plurality of clusters; and mapping the parts of the matrix that correspond to parts of the pre-trained matrix that were in the same cluster to the same substitute sub-vector. 8. The computer-implemented method of claim 7 wherein the number of clusters in the plurality of clusters corresponds to the number of substitute sub-vectors. 9. A computer-implemented method for compressing embedding of a neural network model, the method comprising: dividing each word embedding vector of an embedding matrix having V word embedding vectors into K parts, K being a number larger than 1, each part comprising at least two elements; for each part of the V*K parts of the embedding matrix, mapping the part to one of M substitute sub-vectors comprising one or more parameters, wherein M is a number less than V*K; and training the neural network model using the mapped substitute sub-vectors. 10. The computer-implemented method of claim 9 wherein each word embedding vector is divided into K parts evenly. 11. The computer-implemented method of claim 9 wherein the step of mapping the part to one of M substitute sub-vectors comprises: initializing a list of V*K sub-vector indicator entries, each indicator entry representing one of the M substitute sub-vectors; randomly shuffling the list; and generating a mapping table from the randomly shuffled list. 12. The computer-implemented method of claim 9 wherein the step of mapping the part to one of M substitute sub-vectors comprises: using a pre-trained embedding matrix to estimate sub-vectors to facilitate mapping of parts of the embedding matrix with similar estimated sub-vectors to the same substitute sub-vector. 13. The computer-implemented method of claim 11 wherein the step of using a pre-trained embedding matrix to estimate sub-vectors to facilitate mapping of parts of the embedding matrix with similar estimated sub-vectors to the same substitute sub-vector comprises: clustering parts of the pre-trained embedding matrix into a plurality of clusters; and mapping the parts of the embedding matrix that correspond to parts of the pre-trained embedding matrix that were in the same cluster to the same substitute sub-vector. 14. The computer-implemented method of claim 9 wherein the step of mapping the parts to one of M substitute sub-vectors further comprises forming a mapping table and the mapping table is fixed during training but the one or more parameters of each substitute sub-vectors are subject to updating during training. 15. A computer-implemented method for compressing an output word embedding layer of a neural network model, the method comprising: mapping an output embedding vector into K sub-vectors, K being a number larger than 1, each part comprising at least two elements; dividing a hidden vector of the neural network model into K parts; for each pair in a set of pairs, obtaining and storing a partial dot product for the pair, in which a pair comprises a hidden vector part and a corresponding output embedding sub-vector; for a word, using at least some of the stored partial dot products, which are selected according to the mapping, to obtain a sum value; and normalizing the sum value by a softmax non-linearity function in a softmax layer in the neural network model to obtain an output probability for the word. 16. The computer-implemented method of claim 15 wherein the K sub-vectors are respectively selected from K non-overlap sub-vector sets. 17. The computer-implemented method of claim 15 wherein the K sub-vectors are uniformly mapped. 18. The computer-implemented method of claim 15 wherein the neural network model is a recurrent neural model. 19. The computer-implemented method of claim 15 wherein the K sub-vectors are estimated by pre-training an output embedding matrix and are assigned using a clustering method. 20. The computer-implemented method of claim 19 wherein the K sub-vectors are shared with an input word embedding layer of the neural network model.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11030997B2 cover?
Described herein are systems and methods for compressing or otherwise reducing the memory requirements for storing and computing the model parameters in recurrent neural language models. Embodiments include space compression methodologies that share the structured parameters at the input embedding layer, the output embedding layers, or both of a recurrent neural language model to significantly …
Who is the assignee on this patent?
Baidu Usa Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/123. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 08 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).