What technology area does this patent fall under?

Primary CPC classification G06N99/005. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Feb 01 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Event clustering and classification with document embedding

US2018032897A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2018032897-A1
Application number	US-201615219401-A
Country	US
Kind code	A1
Filing date	Jul 26, 2016
Priority date	Jul 26, 2016
Publication date	Feb 1, 2018
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embedding representation for a document is generated based on clustering words in the document. Representative clusters are selected and weighted sum of the embeddings of the words in the selected clusters is determined as a document embedding. Documents are labeled based on document embeddings. A machine learning algorithm is trained using the documents. The machine learning algorithm predicts a label of a given document based on the given document's document embedding.

First claim

Opening claim text (preview).

We claim: 1 . A method of training a machine to learn to predict a label for data, the method performed by at least one hardware processor, comprising: receiving a document; creating clusters of words in the document based on cosine similarity of word embeddings of words in the document; responsive to determining that the document has a title, ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title; responsive to determining that the document has no title, ranking the clusters based on compactness of a cluster indicating how closely related the words in the cluster are and a semantic distance of the cluster from other clusters; selecting a top-k number of the ranked clusters; determining a document embedding as a weighted sum of word embeddings of words in the top-k number of ranked clusters, wherein the receiving, the creating, the ranking, the selecting, and the determining of the document embedding are performed for multiple documents; labeling each of the multiple documents; and training a machine learning algorithm based on the multiple documents that are labeled, wherein the training comprises separating the multiple documents as a training set and a test set, and generating a machine learning model that predicts a label for a given document based on the training set and the test set. 2 . The method of claim 1 , wherein the document is stored with the document embedding. 3 . The method of claim 1 , wherein the machine learning model comprises a support vector machine. 4 . The method of claim 1 , further comprising: receiving a given document to label; responsive to determining that the given document does not have a document embedding associated with the given document, generating the document embedding associated with the given document by performing the creating, the ranking, the selecting and the determining on the given document; executing the machine learning model based on the document embedding associated with the given document, the machine learning model predicting a label for the given document. 5 . The method of claim 1 , wherein the ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title, responsive to determining that the document has a title, comprises: determining cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title; and the selecting comprises selecting the top-k number of clusters determined to be most similar to the title based on the cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title. 6 . The method of claim 1 , wherein the compactness is measure by a variance of the words in the cluster. 7 . The method of claim 1 , wherein the creating clusters of words in the document based on cosine similarity of word embeddings of words in the document comprises: for a given word in the document, determining a cosine similarity between a word embedding of the given word and the word embeddings of an existing cluster; responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster meets a defined threshold, placing the given word in the existing cluster; and responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster does not meet the defined threshold, placing the given word in a new cluster. 8 . A computer readable storage medium storing a program of instructions executable by a machine to perform a method of training a machine to learn to predict a label for data, the method performed by at least one hardware processor, the method comprising: receiving a document; creating clusters of words in the document based on cosine similarity of word embeddings of words in the document; responsive to determining that the document has a title, ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title; responsive to determining that the document has no title, ranking the clusters based on compactness of a cluster indicating how closely related the words in the cluster are and a semantic distance of the cluster from other clusters; selecting a top-k number of the ranked clusters; determining a document embedding as a weighted sum of word embeddings of words in the top-k number of ranked clusters, wherein the receiving, the creating, the ranking, the selecting, and the determining of the document embedding are performed for multiple documents; labeling each of the multiple documents; and training a machine learning algorithm based on the multiple documents that are labeled, wherein the training comprises separating the multiple documents as a training set and a test set, and generating a machine learning model that predicts a label for a given document based on the training set and the test set. 9 . The computer readable storage medium of claim 8 , wherein the document is stored with the document embedding. 10 . The computer readable storage medium of claim 8 , wherein the machine learning model comprises a support vector machine. 11 . The computer readable storage medium of claim 8 , further comprising: receiving a given document to label; responsive to determining that the given document does not have a document embedding associated with the given document, generating the document embedding associated with the given document by performing the creating, the ranking, the selecting and the determining on the given document; executing the machine learning model based on the document embedding associated with the given document, the machine learning model predicting a label for the given document. 12 . The computer readable storage medium of claim 8 , wherein the ranking the clusters based on cosine similarity of word embeddings of words in a cluster and word embeddings of words in the title, responsive to determining that the document has a title, comprises: determining cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title; and the selecting comprises selecting the top-k number of clusters determined to be most similar to the title based on the cosine similarity between word embeddings of words in each of the clusters and a sum of word embeddings of words in the title. 13 . The computer readable storage medium of claim 8 , wherein the compactness is measure by a variance of the words in the cluster. 14 . The computer readable storage medium of claim 8 , wherein the creating clusters of words in the document based on cosine similarity of word embeddings of words in the document comprises: for a given word in the document, determining a cosine similarity between a word embedding of the given word and the word embeddings of an existing cluster; responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster meets a defined threshold, placing the given word in the existing cluster; and responsive to determining that the cosine similarity between the word embedding of the given word and the word embeddings of the existing cluster does not meet the defined threshold, placing the given word in a new cluster. 15 . A system of training a machine to learn to predict a label fo

Assignees

Inventors

Classifications

G06F17/30705
Physics · mapped topic
G06N99/005Primary
Physics · mapped topic
G06N20/10Primary
using kernel methods, e.g. support vector machines [SVM] · CPC title
G06N5/022
Knowledge engineering; Knowledge acquisition · CPC title
G06N20/00
Machine learning · CPC title

Patent family

Related publications grouped by family.

View patent family 61010119

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2018032897A1 cover?: Embedding representation for a document is generated based on clustering words in the document. Representative clusters are selected and weighted sum of the embeddings of the words in the selected clusters is determined as a document embedding. Documents are labeled based on document embeddings. A machine learning algorithm is trained using the documents. The machine learning algorithm predicts…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06N99/005. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Feb 01 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).