Unsupervised topic modeling for short texts

US2016110343A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016110343-A1
Application numberUS-201414519427-A
CountryUS
Kind codeA1
Filing dateOct 21, 2014
Priority dateOct 21, 2014
Publication dateApr 21, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Topics are determined for short text messages using an unsupervised topic model. In a training corpus created from a number of short text messages, a vocabulary of words is identified, and for each word a distributed vector representation is obtained by processing windows of the corpus having a fixed length. The corpus is modeled as a Gaussian mixture model in which Gaussian components represent topics. To determine a topic of a sample short text message, a posterior distribution over the corpus topics is obtained using the Gaussian mixture model.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for determining a topic of a sample short text message, comprising: by a computer, identifying a vocabulary of words in a corpus, the corpus comprising a plurality of training short text messages; by the computer, obtaining distributed vector representations of the words in the vocabulary by processing windows of the corpus having a fixed length; by the computer, estimating a plurality of Gaussian components of a Gaussian mixture model of the corpus using the distributed vector representations, the Gaussian components representing corpus topics; by the computer, receiving a sample short text message comprising words in the vocabulary; and by the computer, determining the topic of the sample short text message based on a posterior distribution over the corpus topics for the sample short text message, the posterior distribution obtained using the Gaussian mixture model. 2 . The method of claim 1 , wherein obtaining distributed vector representations of the words in the vocabulary further comprises applying a continuous bag of words model to process the windows of the corpus. 3 . The method of claim 2 , wherein applying a continuous bag of words model further comprises using a log-linear model. 4 . The method of claim 1 , wherein obtaining distributed vector representations of the words in the vocabulary further comprises applying a methodology to process the windows of the corpus, the methodology being selected from a group of methodologies consisting of deep neural network, latent semantic indexing, log-linear model, feedforward neural network, convolutional neural network and recurrent neural network. 5 . The method of claim 1 wherein identifying the vocabulary of words in a corpus further comprises using hierarchical sampling to reduce the vocabulary. 6 . The method of claim 5 wherein the hierarchical sampling eliminates words having fewer than five occurrences. 7 . The method of claim 1 wherein the plurality of training short text messages has an average text length of between 12 and 16 words. 8 . The method of claim 1 wherein estimating the plurality of Gaussian components further comprises estimating means, covariances and mixture weights for each Gaussian component using an expectation-maximization algorithm. 9 . The method of claim 8 wherein the covariances are estimated using a covariance matrix approximation wherein the covariances are diagonal matrices. 10 . The method of claim 1 wherein the sample short text message has fewer than 20 words. 11 . The method of claim 1 wherein the posterior distribution over the corpus topics for the short message is determined by evaluating: k * = arg   max  θ k  p  ( k )  ∏ i = 1 N   p  ( w i ′ | k ) where k* is a posterior distribution for a topic k, θ k denotes the parameters for the k th Gaussian component of the Gaussian mixture model, w i ′ is the i th word in the sample short text message and the probabilities p(k) and p(w i ′|k) are obtained from the Gaussian mixture model. 12 . The method of claim 1 , wherein identifying the vocabulary of words in the corpus further comprises representing a phrase of words within the corpus by a single code word to minimize a description length of the corpus. 13 . A message topic trend alert system of a communications network, comprising: at least one interface to the communications network configured for receiving short text messages transmitted within the short message communications network; at least one processor; and at least one computer readable storage device having stored thereon computer readable instructions that, when executed by the at least one processor, cause the at least one processor to perform operations for generating an alert based on a message topic trend, comprising: identifying a vocabulary of words in a corpus, the corpus comprising a plurality of training short text messages; obtaining distributed vector representations of the words in the vocabulary by processing windows of the corpus having a fixed length; estimating a plurality of Gaussian components of a Gaussian mixture model of the corpus using the distributed vector representations, the Gaussian components representing corpus topics; receiving a plurality of sample short text messages comprising words in the vocabulary; determining topics of the sample short text messages based on a posterior distribution over the corpus topics for the sample short text messages, the posterior distribution obtained using the Gaussian mixture model; identifying a trend in topics of the short text messages; and generating an alert based on the trend. 14 . The system of claim 13 , wherein obtaining distributed vector representations of the words in the vocabulary further comprises applying a continuous bag of words model to process the windows of the corpus. 15 . The system of claim 14 , wherein applying a continuous bag of words model further comprises using a log-linear model. 16 . The system of claim 13 wherein identifying the vocabulary of words in a corpus further comprises using hierarchical sampling to reduce the vocabulary. 17 . The system of claim 13 wherein estimating the plurality of Gaussian components further comprises estimating means, covariances and mixture weights for each Gaussian component using an expectation-maximization algorithm. 18 . The system of claim 13 wherein the posterior distribution over the corpus topics for the short message is determined by evaluating: k * = arg  max θ k 

Assignees

Inventors

Classifications

  • H04W4/14Primary

    Short messaging services, e.g. short message services [SMS] or unstructured supplementary service data [USSD] · CPC title

  • using neural networks · CPC title

  • G06F40/216Primary

    using statistical methods · CPC title

  • Semantic analysis · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016110343A1 cover?
Topics are determined for short text messages using an unsupervised topic model. In a training corpus created from a number of short text messages, a vocabulary of words is identified, and for each word a distributed vector representation is obtained by processing windows of the corpus having a fixed length. The corpus is modeled as a Gaussian mixture model in which Gaussian components represen…
Who is the assignee on this patent?
At & T Ip I Lp
What technology area does this patent fall under?
Primary CPC classification H04W4/14. Mapped technology areas include Electricity.
When was this patent published?
Publication date Thu Apr 21 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).