System and method for news events detection and visualization
US-2016004764-A1 · Jan 7, 2016 · US
US11663254B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11663254-B2 |
| Application number | US-201715418763-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 29, 2017 |
| Priority date | Jan 29, 2016 |
| Publication date | May 30, 2023 |
| Grant date | May 30, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present invention provides a seeded news event clustering and retrieval system configured to first create a candidate data set of documents, second create a set of initial clusters based on nearness or duplicate similarity status, and third create an aggregate cluster by merging initial clusters with seed documents. The invention generates top-level clusters for news events based on an editorially supplied topical label or “seed” component and generates sub-topic-focused clusters based on algorithm. The system uses an agglomerative clustering algorithm to gather and structure documents into distinct result sets. Decisions on whether to merge related documents or clusters are made according to similarity of evidence derived from two distinct sources, one, relying on a digital signature based on the unstructured text in the document, the other based on the presence of named entity tags that have been assigned to the document by an event or named entity tagger such as the Thomson Reuters Calais engine/web service.
Opening claim text (preview).
We claim: 1. A computer-based system connected via a communications network to a plurality of news content sources, the system comprising: a news repository database comprising a primary set of documents and a secondary set of documents, each of the primary set of documents having a story line feature and an assigned and predefined event label; a digital communications interface having an input and an output, the input adapted to retrieve information from the news repository database and receive an input retrieval expression; an event clustering engine adapted to cluster documents about an event and comprising: a data set creation module adapted to load a set of documents for potential news event clustering into a candidate data set, the candidate data set including documents from both the primary set of documents and the secondary set of documents; an initial cluster module adapted to generate digital signature metadata for each document in the set of documents, the digital signature metadata being separate from each document in the set of documents and comprising a data structure including an assigned event label representing a document topical nature derived from unstructured text for each document in the set of documents for the candidate data set, wherein the event label is a document feature stored in the data structure; the initial cluster module adapted to compare the digital signature metadata related to the candidate data set and 1) to identify and remove duplicate documents and 2) to cluster a set of documents from the candidate data set to form an initial cluster, the initial cluster module adapted to form a plurality of initial clusters, wherein each initial cluster is formed based at least in part on matching document event labels; and an aggregate cluster module adapted to execute an algorithmic similarity function to measure similarity between features related to initial clusters formed by the initial cluster module, the features including the event label feature, the aggregate cluster module further adapted to merge in whole or in part one or more initial clusters to form an aggregate cluster about a seed document from the primary set of documents based on measured similarity, wherein merging initial clusters to form the aggregate cluster is based at least in part on a similarity between an event label tagged to the seed document and one or more features associated with the initial clusters; and a retrieval engine comprising: an event identification module adapted to identify an event of interest related to a received input retrieval expression; and a match module adapted to match the identified event of interest with one or more aggregate clusters; wherein the output of the digital communications interface is adapted to output for display at a computing device a representation of an aggregated cluster in response to the received input retrieval expression. 2. The system of claim 1 further comprising a graphic user interface adapted to present a graphic representation of the aggregated cluster set of documents via a display associated with the computing device. 3. The system of claim 1 , wherein the data set creation module comprises a recommendation classifier adapted to discriminate among documents to arrive at the candidate data set based on a set of criteria. 4. The system of claim 1 , wherein the aggregate cluster module is further adapted to execute an algorithmic similarity function to measure similarity between a set of digital signatures. 5. The system of claim 1 , wherein the initial clustering module is adapted to apply heuristic processes based on a set of features to first reduce the number of digital signatures compared in arriving at the initial cluster of document records. 6. The system of claim 1 wherein the data set creation module is further adapted to populate a candidate data set table, the initial cluster module is further adapted to populate an initial cluster table, and the aggregate cluster module is further adapted to populate an aggregate cluster table, wherein the aggregate cluster module applies an algorithm representing a set of document features stored in the initial cluster table to determine merging of initial clusters from the plurality of initial clusters into the aggregate cluster and storing data related to the aggregate cluster into the aggregate cluster table. 7. The system of claim 1 wherein the aggregate cluster module determines merging of clusters from the initial cluster set based on a determined similarity between two or more of: unstructured text contained in content received from the candidate data set; tagged entity names appearing in the candidate data set; and digital signatures derived from unstructured text contained in content from the candidate data set. 8. The system of claim 1 wherein the aggregate cluster module determines merging of clusters by analyzing data structures represented in vector form. 9. The system of claim 8 wherein a first vector representation of a digital signature associated with the unstructured text of a document is term-based and is used to determine a degree of overlap between two document representatives of their clusters and a second vector is tag-based and is associated with the structured text of a document in the cluster and is used to determine a degree of overlap between two document representatives of their clusters. 10. The system of claim 1 wherein the output of the digital communications interface is adapted to output for display at the computing device a graphical representation of an aggregated cluster. 11. A computer-based system connected via a communications network to a plurality of news content sources, the system comprising: a news repository database comprising a primary set of documents and a secondary set of documents, each of the primary set of documents having a story line feature and an assigned and predefined event label; a digital communications interface having an input and an output, the input adapted to retrieve information from the news repository database; an event clustering engine adapted to cluster documents from the news repository database about an event, the event clustering engine comprising: a data set creation module adapted to load a set of documents for potential news event clustering into a candidate data set, the candidate data set including documents from both the primary set of documents and the secondary set of documents; an initial cluster module adapted to generate digital signature metadata for each document in the set of documents, the digital signature metadata being separate from each document in the set of documents and comprising a data structure including an assigned event label representing a document topical nature derived from unstructured text for each document in the set of documents for the candidate data set, wherein the event label is a document feature stored in the data structure; the initial cluster module adapted to compare the digital signature metadata related to the candidate data set and 1) to identify and remove duplicate documents and 2) to cluster a set of documents from the candidate data set to form an initial cluster, the initial cluster module adapted to form a plurality of initial clusters, wherein each initial cluster is formed based at least in part on matching document event labels; and an aggregate cluster module adapted to execute an algorithmic similarity function to measure similarity between features related to initial clusters formed by the initial cluster module, the features including the event label feature, the aggregate cluster module further adapted to merge in whole or in part, based on measured s
Creation or modification of classes or clusters · CPC title
Named entity recognition · CPC title
Calculation of difference between files · CPC title
Query execution (filtering based on additional data G06F16/335) · CPC title
Browsing; Visualisation therefor · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.