Method and apparatus for managing recommendation models
US-9218605-B2 · Dec 22, 2015 · US
US2018032897A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2018032897-A1 |
| Application number | US-201615219401-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jul 26, 2016 |
| Priority date | Jul 26, 2016 |
| Publication date | Feb 1, 2018 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embedding representation for a document is generated based on clustering words in the document. Representative clusters are selected and weighted sum of the embeddings of the words in the selected clusters is determined as a document embedding. Documents are labeled based on document embeddings. A machine learning algorithm is trained using the documents. The machine learning algorithm predicts a label of a given document based on the given document's document embedding.
Opening claim text (preview).
We claim: 1 . A method of training a machine to learn to predict a label for data, the method performed by at least one hardware processor, comprising: receiving a document; creating clusters of words in the document based on cosine similarity of word embeddings of words in the document; responsive to determining that the document has a title, ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title; responsive to determining that the document has no title, ranking the clusters based on compactness of a cluster indicating how closely related the words in the cluster are and a semantic distance of the cluster from other clusters; selecting a top-k number of the ranked clusters; determining a document embedding as a weighted sum of word embeddings of words in the top-k number of ranked clusters, wherein the receiving, the creating, the ranking, the selecting, and the determining of the document embedding are performed for multiple documents; labeling each of the multiple documents; and training a machine learning algorithm based on the multiple documents that are labeled, wherein the training comprises separating the multiple documents as a training set and a test set, and generating a machine learning model that predicts a label for a given document based on the training set and the test set. 2 . The method of claim 1 , wherein the document is stored with the document embedding. 3 . The method of claim 1 , wherein the machine learning model comprises a support vector machine. 4 . The method of claim 1 , further comprising: receiving a given document to label; responsive to determining that the given document does not have a document embedding associated with the given document, generating the document embedding associated with the given document by performing the creating, the ranking, the selecting and the determining on the given document; executing the machine learning model based on the document embedding associated with the given document, the machine learning model predicting a label for the given document. 5 . The method of claim 1 , wherein the ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title, responsive to determining that the document has a title, comprises: determining cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title; and the selecting comprises selecting the top-k number of clusters determined to be most similar to the title based on the cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title. 6 . The method of claim 1 , wherein the compactness is measure by a variance of the words in the cluster. 7 . The method of claim 1 , wherein the creating clusters of words in the document based on cosine similarity of word embeddings of words in the document comprises: for a given word in the document, determining a cosine similarity between a word embedding of the given word and the word embeddings of an existing cluster; responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster meets a defined threshold, placing the given word in the existing cluster; and responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster does not meet the defined threshold, placing the given word in a new cluster. 8 . A computer readable storage medium storing a program of instructions executable by a machine to perform a method of training a machine to learn to predict a label for data, the method performed by at least one hardware processor, the method comprising: receiving a document; creating clusters of words in the document based on cosine similarity of word embeddings of words in the document; responsive to determining that the document has a title, ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title; responsive to determining that the document has no title, ranking the clusters based on compactness of a cluster indicating how closely related the words in the cluster are and a semantic distance of the cluster from other clusters; selecting a top-k number of the ranked clusters; determining a document embedding as a weighted sum of word embeddings of words in the top-k number of ranked clusters, wherein the receiving, the creating, the ranking, the selecting, and the determining of the document embedding are performed for multiple documents; labeling each of the multiple documents; and training a machine learning algorithm based on the multiple documents that are labeled, wherein the training comprises separating the multiple documents as a training set and a test set, and generating a machine learning model that predicts a label for a given document based on the training set and the test set. 9 . The computer readable storage medium of claim 8 , wherein the document is stored with the document embedding. 10 . The computer readable storage medium of claim 8 , wherein the machine learning model comprises a support vector machine. 11 . The computer readable storage medium of claim 8 , further comprising: receiving a given document to label; responsive to determining that the given document does not have a document embedding associated with the given document, generating the document embedding associated with the given document by performing the creating, the ranking, the selecting and the determining on the given document; executing the machine learning model based on the document embedding associated with the given document, the machine learning model predicting a label for the given document. 12 . The computer readable storage medium of claim 8 , wherein the ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title, responsive to determining that the document has a title, comprises: determining cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title; and the selecting comprises selecting the top-k number of clusters determined to be most similar to the title based on the cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title. 13 . The computer readable storage medium of claim 8 , wherein the compactness is measure by a variance of the words in the cluster. 14 . The computer readable storage medium of claim 8 , wherein the creating clusters of words in the document based on cosine similarity of word embeddings of words in the document comprises: for a given word in the document, determining a cosine similarity between a word embedding of the given word and the word embeddings of an existing cluster; responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster meets a defined threshold, placing the given word in the existing cluster; and responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster does not meet the defined threshold, placing the given word in a new cluster. 15 . A system of training a machine to learn to predict a label fo
Physics · mapped topic
Physics · mapped topic
using kernel methods, e.g. support vector machines [SVM] · CPC title
Knowledge engineering; Knowledge acquisition · CPC title
Machine learning · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.