Embeddings with classes

US11373042B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11373042-B2
Application numberUS-201916702299-A
CountryUS
Kind codeB2
Filing dateDec 3, 2019
Priority dateDec 13, 2018
Publication dateJun 28, 2022
Grant dateJun 28, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described herein are systems and methods for word embeddings to avoid the need to throw out rare words appearing less than a certain number of times in a corpus. Embodiments of the present disclosure involve group words into clusters/classes for multiple times using different assignments of the vocabulary words to a number of classes. Multiple copies of the training corpus are then generated using the assignments to replace each word with the appropriate class. A word embedding generating model is run on the multiple class corpora to generate multiple class embeddings. An estimate of the gold word embedding matrix is then reconstructed from multiple pairs of assignments, class embeddings, and covariances. Test results show the effectiveness of embodiments of the present disclosure.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for embedding words of a corpus, the method comprising: grouping a vocabulary of multiple words in the corpus into a set of blocks of words; for each block: selecting a number of classes such that each class appears no less than a threshold frequency in the corpus; in n iterations, assigning the words in the block into the number of classes to obtain n assignment matrices, n is an integer larger than 1; given the n assignment matrices, obtaining n input corpuses by replacing words in the corpus with its class identifier; obtaining n class embeddings by running a word embedding generating model on the n input corpuses; and reconstructing an estimated word embedding for the vocabulary based on at least the n class embeddings. 2. The computer-implemented method of claim 1 wherein the vocabulary is sorted by frequency. 3. The computer-implemented method of claim 2 wherein: for each block, assigning each word having a frequency greater than or equal to the threshold frequency of appearance in the corpus to a class by itself; and grouping two or more words with the appearance frequency in the corpus less than the threshold frequency together into a class. 4. The computer-implemented method of claim 1 wherein for each iteration, the assignment of words are performed to reduce intersection between words in a class across iterations. 5. The computer-implemented method of claim 4 wherein the step of assigning is performed by hashing. 6. The computer-implemented method of claim 1 wherein each assignment matrix is a sparse matrix with rows having unit length. 7. The computer-implemented method of claim 1 wherein the number of classes is a prime number. 8. The computer-implemented method of claim 1 wherein the reconstructed word embedding is an average of the n class embeddings. 9. The computer-implemented method of claim 1 wherein the reconstructed word embedding is reconstructed from an implementation of averaging from the n class embeddings, n is a number less than the number of words in the vocabulary divided by the number of classes. 10. The computer-implemented method of claim 1 wherein words from different blocks are not assigned to the same class. 11. A computer-implemented method for embedding words of a corpus comprising a number of words, the method comprising: in n iterations, assigning words in the corpus into a number of classes to obtain n assignment matrices, n is an integer larger than 1, the number of classes is less than the number of words in the corpus; given the n assignment matrices, obtaining n input corpuses by replacing words in the corpus with its class; obtaining n class embeddings by running a word embedding generating model on the n input corpuses; and reconstructing an estimated word embedding for the corpus based at least on the n assignment matrices and the n class embeddings. 12. The computer-implemented method of claim 11 further comprising: comparing the estimated word embedding with a ground truth embedding of the corpus for evaluation. 13. The computer-implemented method of claim 11 wherein for each iteration, the assignment of words are performed to reduce intersection between words in a class across iterations. 14. The computer-implemented method of claim 11 wherein each assignment matrix is a sparse matrix with rows having unit length. 15. The computer-implemented method of claim 11 wherein the number of classes is a prime number. 16. A computer-implemented method for embedding words of a corpus, the method comprising: receiving a block comprising a number of words; determining a number of classes for the block of words, the number of classes is less than the number of words in the block; and generating a class embedding matrix having the number of class embeddings to represent the block of words, the class embedding matrix uses a storage space less than a memory limit of a processor unit, the class embedding matrix is generated by: determining a threshold on frequency based on the number of words in the block and the number of classes; computing a class assignment matrix for the block to assign words in the block into corresponding classes based on an appearing frequency of each word in a corpus and the threshold on frequency; creating, using the class assignment matrix, a copy of the block, where in the copy of the block, each word in the block is replaced by an appropriate class; and running a word embedding generating model on the copy of the block to create the class embedding matrix. 17. The computer-implemented method of claim 16 wherein the processor unit is a graphic processor unit (GPU). 18. The computer-implemented method of claim 16 wherein the assignment matrix is a sparse matrix with rows having unit length. 19. The computer-implemented method of claim 16 wherein the threshold on frequency is set as the smallest integral value greater than a division result using the number of words divided by the number of classes. 20. The computer-implemented method of claim 19 wherein computing the class assignment matrix comprising: assigning each word having an appearing frequency greater than or equal to the threshold on frequency to a class by itself; and assigning two or more words having an appearing frequency less than the threshold on frequency together into a class.

Assignees

Inventors

Classifications

  • G06F16/35Primary

    Clustering; Classification · CPC title

  • Semantic analysis · CPC title

  • Inference or reasoning models · CPC title

  • Ontology · CPC title

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11373042B2 cover?
Described herein are systems and methods for word embeddings to avoid the need to throw out rare words appearing less than a certain number of times in a corpus. Embodiments of the present disclosure involve group words into clusters/classes for multiple times using different assignments of the vocabulary words to a number of classes. Multiple copies of the training corpus are then generated us…
Who is the assignee on this patent?
Baidu Usa Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/35. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 28 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).