Scalable mining of trending insights from text

US10733221B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10733221-B2
Application numberUS-201615085714-A
CountryUS
Kind codeB2
Filing dateMar 30, 2016
Priority dateMar 30, 2016
Publication dateAug 4, 2020
Grant dateAug 4, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method for identifying trending topics in a document corpus are provided. First, multiple topics are identified, some of which topics may be filtered or removed based on co-occurrence. Then, for each remaining topic, a frequency of the topic in the document corpus is determined, one or more frequencies of the topic in one or more other document corpora are determined, a trending score of the topic is generated based on the determined frequencies. Lastly, the remaining topics are ranked based on the generated trending scores.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: storing, in an electronic data store, a plurality of digital documents; accessing the electronic data store to identify a first plurality of topics in the plurality of digital documents; determining a co-occurrence of each pair of topics in a plurality of pairs of topics in the first plurality of topics; based on a deduplication threshold and the co-occurrence of each pair of topics in the plurality of pairs of topics, identifying a strict subset of the plurality of pairs of topics; based on the strict subset of the plurality of pairs of topics, removing multiple topics from the first plurality of topics to identify a second plurality of topics that includes fewer topics than the first plurality of topics; for each topic in the second plurality of topics: determining one or more frequencies of said each topic, wherein determining the one or more frequencies comprises, for each time period of one or more time periods, determining a frequency of said each topic during said each time period; determining a particular frequency of said each topic in a particular time period that is subsequent to the one or more time periods; generating a trending score for said each topic based on the one or more frequencies and the particular frequency; generating a ranking of the second plurality of topics based on the trending score for each topic in the second plurality of topics; causing the second plurality of topics to be arranged on a screen of a computing device based on the ranking of the second plurality of topics; wherein the method is performed by one or more computing devices. 2. The method of claim 1 , further comprising: storing a plurality of document corpora, wherein each document corpus of the plurality of document corpora is associated with a different time period of a plurality of time periods that includes the one or more time periods and the particular time period; for a first document corpus of the plurality of document corpora: analyzing the first document corpus to identify a first set of topics, and for each topic in the first set of topics, determining a number of instances, in the first document corpus, of said each topic; for a second document corpus of the plurality of document corpora: analyzing the second document corpus to identify a second set of topics, and for each topic in the second set of topics, determining a number of instances, in the second document corpus, of said each topic. 3. The method of claim 1 , wherein: the one or more periods are a plurality of periods; the one or more frequencies are a plurality of frequencies; each frequency in the plurality of frequencies corresponds to a different period of the plurality of periods; generating the trending score comprises generating the trending score based on each individual frequency in the plurality of frequencies and the particular frequency. 4. The method of claim 3 , wherein: generating the trending score comprises calculating a difference between the particular frequency and an aggregation of the plurality of frequencies, wherein the aggregation involves computing an average or a median of multiple frequency-related values. 5. The method of claim 4 , wherein: generating the trending score comprises calculating a ratio of the difference and the aggregation. 6. The method of claim 1 wherein generating the trending score comprises: selecting, based on the one or more frequencies, a smoother coefficient that reduces the sensitivity of a normalized difference between the particular frequency and a past frequency that is based on the one or more frequencies; generating the trending score based on the smoother coefficient and a difference between the particular frequency and the past frequency. 7. The method of claim 6 , wherein generating the trending score comprises: for a first topic in the plurality of topics: determining one or more first frequencies of the first topic; determining a first current frequency of the first topic; selecting, based on the one or more first frequencies, a first smoother coefficient that reduces the sensitivity of a first normalized difference between the first current frequency and a first past frequency that is based on the one or more first frequencies; generating a first trending score based on the first smoother coefficient and a difference between the first current frequency and the first past frequency; for a second topic, in the plurality of topics, that is different than the first topic: determining one or more second frequencies of the second topic; determining a second current frequency of the second topic; selecting, based on the one or more second frequencies, a second smoother coefficient that is different than the first smoother coefficient that reduces the sensitivity of a second normalized difference between the second current frequency and a second past frequency that is based on the one or more second frequencies; generating a second trending score based on the second smoother coefficient and a difference between the second current frequency and the second past frequency. 8. The method of claim 6 , further comprising: determining which topics in the plurality of topics were selected based on user input; based on the user input, adjusting a smoother function that generates the smoother coefficient. 9. The method of claim 1 , wherein determining the co-occurrence of pairs of topics in the first plurality of topics comprises limiting the determining to the same sentence, wherein a pair of topics co-occur only if both topics appear in the same sentence. 10. The method of claim 1 , wherein a document in the plurality of digital documents is a blog post, a comment on an online posting, or a tweet. 11. The method of claim 1 , further comprising: for each topic of the first plurality of topics: storing, in a second electronic data store, in association with said each topic, (1) a list of document identifiers, each of which identifies a digital document in which said each topic was detected and (2) a list of section identifiers that correspond to the list of document identifiers and identifies a section, of one of the digital documents identified by a document identifier in the list, in which said each topic was detected; wherein determining the co-occurrence of each pair of topics in the plurality of pairs of topics in the first plurality of topics comprises, for each pair of topics in the plurality of pairs of topics: identifying a first document identifier and a first section identifier of a first topic in said each pair of topics; identifying a second document identifier and a second section identifier of a second topic in said each pair of topics; determining that the first topic and the second topic co-occur in a digital document in response to determining that the first document identifier matches the second document identifier and that the first section identifier matches the second section identifier. 12. A system comprising: one or more processors; one or more storage media storing instructions which, when executed by the one or more instructions, cause: storing, in a database, a plurality of digital documents; accessing the database to identify a first plurality of topics within digital text of the plurality of digital documents; determining a co-occurrence of each pair of topics in a plurality of pairs of topics in the first plurality of topics; based on a deduplication threshold and the co-occurrence of each pair of topics in the plurality of pairs of topics, identifying a strict subset of the plurality of pairs of topics; bas

Assignees

Inventors

Classifications

  • G06F40/279Primary

    Recognition of textual entities · CPC title

  • Document management systems · CPC title

  • Indexing; Web crawling techniques · CPC title

  • into predefined classes · CPC title

  • using statistical methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10733221B2 cover?
A system and method for identifying trending topics in a document corpus are provided. First, multiple topics are identified, some of which topics may be filtered or removed based on co-occurrence. Then, for each remaining topic, a frequency of the topic in the document corpus is determined, one or more frequencies of the topic in one or more other document corpora are determined, a trending sc…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/279. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 04 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).