Event clustering and classification with document embedding

US2018032897A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2018032897-A1
Application numberUS-201615219401-A
CountryUS
Kind codeA1
Filing dateJul 26, 2016
Priority dateJul 26, 2016
Publication dateFeb 1, 2018
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embedding representation for a document is generated based on clustering words in the document. Representative clusters are selected and weighted sum of the embeddings of the words in the selected clusters is determined as a document embedding. Documents are labeled based on document embeddings. A machine learning algorithm is trained using the documents. The machine learning algorithm predicts a label of a given document based on the given document's document embedding.

First claim

Opening claim text (preview).

We claim: 1 . A method of training a machine to learn to predict a label for data, the method performed by at least one hardware processor, comprising: receiving a document; creating clusters of words in the document based on cosine similarity of word embeddings of words in the document; responsive to determining that the document has a title, ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title; responsive to determining that the document has no title, ranking the clusters based on compactness of a cluster indicating how closely related the words in the cluster are and a semantic distance of the cluster from other clusters; selecting a top-k number of the ranked clusters; determining a document embedding as a weighted sum of word embeddings of words in the top-k number of ranked clusters, wherein the receiving, the creating, the ranking, the selecting, and the determining of the document embedding are performed for multiple documents; labeling each of the multiple documents; and training a machine learning algorithm based on the multiple documents that are labeled, wherein the training comprises separating the multiple documents as a training set and a test set, and generating a machine learning model that predicts a label for a given document based on the training set and the test set. 2 . The method of claim 1 , wherein the document is stored with the document embedding. 3 . The method of claim 1 , wherein the machine learning model comprises a support vector machine. 4 . The method of claim 1 , further comprising: receiving a given document to label; responsive to determining that the given document does not have a document embedding associated with the given document, generating the document embedding associated with the given document by performing the creating, the ranking, the selecting and the determining on the given document; executing the machine learning model based on the document embedding associated with the given document, the machine learning model predicting a label for the given document. 5 . The method of claim 1 , wherein the ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title, responsive to determining that the document has a title, comprises: determining cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title; and the selecting comprises selecting the top-k number of clusters determined to be most similar to the title based on the cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title. 6 . The method of claim 1 , wherein the compactness is measure by a variance of the words in the cluster. 7 . The method of claim 1 , wherein the creating clusters of words in the document based on cosine similarity of word embeddings of words in the document comprises: for a given word in the document, determining a cosine similarity between a word embedding of the given word and the word embeddings of an existing cluster; responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster meets a defined threshold, placing the given word in the existing cluster; and responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster does not meet the defined threshold, placing the given word in a new cluster. 8 . A computer readable storage medium storing a program of instructions executable by a machine to perform a method of training a machine to learn to predict a label for data, the method performed by at least one hardware processor, the method comprising: receiving a document; creating clusters of words in the document based on cosine similarity of word embeddings of words in the document; responsive to determining that the document has a title, ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title; responsive to determining that the document has no title, ranking the clusters based on compactness of a cluster indicating how closely related the words in the cluster are and a semantic distance of the cluster from other clusters; selecting a top-k number of the ranked clusters; determining a document embedding as a weighted sum of word embeddings of words in the top-k number of ranked clusters, wherein the receiving, the creating, the ranking, the selecting, and the determining of the document embedding are performed for multiple documents; labeling each of the multiple documents; and training a machine learning algorithm based on the multiple documents that are labeled, wherein the training comprises separating the multiple documents as a training set and a test set, and generating a machine learning model that predicts a label for a given document based on the training set and the test set. 9 . The computer readable storage medium of claim 8 , wherein the document is stored with the document embedding. 10 . The computer readable storage medium of claim 8 , wherein the machine learning model comprises a support vector machine. 11 . The computer readable storage medium of claim 8 , further comprising: receiving a given document to label; responsive to determining that the given document does not have a document embedding associated with the given document, generating the document embedding associated with the given document by performing the creating, the ranking, the selecting and the determining on the given document; executing the machine learning model based on the document embedding associated with the given document, the machine learning model predicting a label for the given document. 12 . The computer readable storage medium of claim 8 , wherein the ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title, responsive to determining that the document has a title, comprises: determining cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title; and the selecting comprises selecting the top-k number of clusters determined to be most similar to the title based on the cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title. 13 . The computer readable storage medium of claim 8 , wherein the compactness is measure by a variance of the words in the cluster. 14 . The computer readable storage medium of claim 8 , wherein the creating clusters of words in the document based on cosine similarity of word embeddings of words in the document comprises: for a given word in the document, determining a cosine similarity between a word embedding of the given word and the word embeddings of an existing cluster; responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster meets a defined threshold, placing the given word in the existing cluster; and responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster does not meet the defined threshold, placing the given word in a new cluster. 15 . A system of training a machine to learn to predict a label fo

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • G06N99/005Primary

    Physics · mapped topic

  • G06N20/10Primary

    using kernel methods, e.g. support vector machines [SVM] · CPC title

  • Knowledge engineering; Knowledge acquisition · CPC title

  • Machine learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2018032897A1 cover?
Embedding representation for a document is generated based on clustering words in the document. Representative clusters are selected and weighted sum of the embeddings of the words in the selected clusters is determined as a document embedding. Documents are labeled based on document embeddings. A machine learning algorithm is trained using the documents. The machine learning algorithm predicts…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N99/005. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Feb 01 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).