Method of automated discovery of new topics

US9626623B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9626623-B2
Application numberUS-201514919631-A
CountryUS
Kind codeB2
Filing dateOct 21, 2015
Priority dateDec 2, 2013
Publication dateApr 18, 2017
Grant dateApr 18, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to a method for performing automated discovery of new topics from unlimited documents related to any subject domain, employing a multi-component extension of Latent Dirichlet Allocation (MC-LDA) topic models, to discover related topics in a corpus. The resulting data may contain millions of term vectors from any subject domain identifying the most distinguished co-occurring topics that users may be interested in, for periodically building new topic ID models using new content, which may be employed to compare one by one with existing model to measure the significance of changes, using term vectors differences with no correlation with a Periodic New Model, for periodic updates of automated discovery of new topics, which may be used to build a new topic ID model in-memory database to allow query-time linking on massive data-set for automated discovery of new topics.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: identifying, by a computer, in one or more document corpora of a data source, a topic of interest based upon one or more concurring topics identified in the one or more document corpora; automatically extracting, by the computer, from a document corpus, data associated with a plurality of co-occurring topics based on the topic of interest; in response to automatically extracting the data associated with the plurality of co-occurring topics, extracting, by the computer, a plurality of topic identifiers from the plurality of co-occurring topics; generating, by the computer, a periodic topic model comprising a set of one or more term vectors by comparing topic significance among the plurality of topic identifiers; periodically creating, by the computer, new topic ID models using data content in the periodic topic model by identifying a similarity of topics, wherein the new topic ID models are stored in an in-memory database; and linking, by the computer, data in the in-memory database for automated discovery of new topics. 2. The method of claim 1 , further comprising determining, by the computer, a relationship of corresponding term vectors from the plurality of co-occurring topics, each co-occurring topic of the plurality of co-occurring topics containing one or more term vectors. 3. The method of claim 2 , further comprising generating, by the computer, a master topic computer model comprising a first set of one or more term vectors identified in text of the document corpus upon determining the relationship of the corresponding term vectors from the plurality of co-occurring topics. 4. The method of claim 3 , further comprising selecting, by the computer, one or more new topics by identifying one or more term vectors from the set of the one or more term vectors in the periodic topic computer model that has no correlation with the first set of one or more term vectors in the master topic computer model. 5. The method of claim 3 , further comprising adding, via the computer, one or more new topics to the master topic computer model. 6. The method of claim 1 , wherein comparing the topic significance among the plurality of topic identifiers is based on a predetermined significance threshold. 7. The method of claim 3 , wherein the master topic computer model is a multi-component extension of a Latent Dirichlet Allocation (MC-LDA) topic model. 8. The method of claim 1 , wherein the periodic topic computer model is a multi-component extension of a Latent Dirichlet Allocation (MC-LDA) topic model. 9. The method of claim 1 , wherein the set of the one or more term vectors in the periodic topic computer model corresponds to a second set of the one or more term vectors. 10. A system comprising: a database source computer module configured to extract data associated with a plurality of co-occurring topics in a document corpus; and one or more computers comprising one or more processors configured to: identify, in the document corpus stored in the database source, an indication of a topic of interest; automatically extract from a document corpus, data associated with a plurality of co-occurring topics based on the topic of interest; extract a plurality of topic identifiers from the plurality of co-occurring topics in response to the extracting of the data associated with the plurality of co-occurring topics; create a periodic topic model comprising a set of one or more term vectors by comparing topic significance among the plurality of topic identifiers; periodically create new topic ID models using data content in the periodic topic model by identifying a similarity of topics, wherein the new topic ID models are stored in an in-memory database; and link data in the in-memory database for automated discovery of new topics. 11. The system of claim 10 , wherein the one or more computers are further configured to determine a relationship of corresponding term vectors from the plurality of co-occurring topics where each co-occurring topic of the plurality of co-occurring topics containing one or more term vectors. 12. The system of claim 11 , wherein the one or more computers are further configured to generate a master topic computer model comprising a first set of one or more term vectors identified in text of the document corpus upon determining the relationship of the corresponding term vectors from the plurality of co-occurring topics. 13. The system of claim 12 , wherein the one or more computers are further configured to select one or more new topics by identifying one or more term vectors from the set of the one or more term vectors in the periodic topic model that has no correlation with the first set of one or more term vectors in the master topic computer model. 14. The system of claim 12 , wherein the one or more computers are further configured to add one or more new topics to the master topic computer model. 15. The system of claim 10 , wherein comparing the topic significance among the plurality of topic identifiers is based on a predetermined significance threshold. 16. The system of claim 12 , wherein the master topic computer model is a multi-component extension of a Latent Dirichlet Allocation (MC-LDA) topic model. 17. The system of claim 10 , wherein the periodic topic computer model is a multi-component extension of a Latent Dirichlet Allocation (MC-LDA) topic model. 18. The system of claim 10 , wherein the set of the one or more term vectors in the periodic topic computer model corresponds to a second set of the one or more term vectors.

Assignees

Inventors

Classifications

  • Document management systems · CPC title

  • Parsing · CPC title

  • G06N20/00Primary

    Machine learning · CPC title

  • Handling natural language data (speech analysis or synthesis, speech recognition G10L) · CPC title

  • Clustering or classification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9626623B2 cover?
The present disclosure relates to a method for performing automated discovery of new topics from unlimited documents related to any subject domain, employing a multi-component extension of Latent Dirichlet Allocation (MC-LDA) topic models, to discover related topics in a corpus. The resulting data may contain millions of term vectors from any subject domain identifying the most distinguished co…
Who is the assignee on this patent?
Qbase Llc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 18 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).