Who is the assignee on this patent?

Thomson Reuters Entpr Centre Gmbh

What technology area does this patent fall under?

Primary CPC classification G06F40/194. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue May 30 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

System and engine for seeded clustering of news events

US11663254B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11663254-B2
Application number	US-201715418763-A
Country	US
Kind code	B2
Filing date	Jan 29, 2017
Priority date	Jan 29, 2016
Publication date	May 30, 2023
Grant date	May 30, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present invention provides a seeded news event clustering and retrieval system configured to first create a candidate data set of documents, second create a set of initial clusters based on nearness or duplicate similarity status, and third create an aggregate cluster by merging initial clusters with seed documents. The invention generates top-level clusters for news events based on an editorially supplied topical label or “seed” component and generates sub-topic-focused clusters based on algorithm. The system uses an agglomerative clustering algorithm to gather and structure documents into distinct result sets. Decisions on whether to merge related documents or clusters are made according to similarity of evidence derived from two distinct sources, one, relying on a digital signature based on the unstructured text in the document, the other based on the presence of named entity tags that have been assigned to the document by an event or named entity tagger such as the Thomson Reuters Calais engine/web service.

First claim

Opening claim text (preview).

We claim: 1. A computer-based system connected via a communications network to a plurality of news content sources, the system comprising: a news repository database comprising a primary set of documents and a secondary set of documents, each of the primary set of documents having a story line feature and an assigned and predefined event label; a digital communications interface having an input and an output, the input adapted to retrieve information from the news repository database and receive an input retrieval expression; an event clustering engine adapted to cluster documents about an event and comprising: a data set creation module adapted to load a set of documents for potential news event clustering into a candidate data set, the candidate data set including documents from both the primary set of documents and the secondary set of documents; an initial cluster module adapted to generate digital signature metadata for each document in the set of documents, the digital signature metadata being separate from each document in the set of documents and comprising a data structure including an assigned event label representing a document topical nature derived from unstructured text for each document in the set of documents for the candidate data set, wherein the event label is a document feature stored in the data structure; the initial cluster module adapted to compare the digital signature metadata related to the candidate data set and 1) to identify and remove duplicate documents and 2) to cluster a set of documents from the candidate data set to form an initial cluster, the initial cluster module adapted to form a plurality of initial clusters, wherein each initial cluster is formed based at least in part on matching document event labels; and an aggregate cluster module adapted to execute an algorithmic similarity function to measure similarity between features related to initial clusters formed by the initial cluster module, the features including the event label feature, the aggregate cluster module further adapted to merge in whole or in part one or more initial clusters to form an aggregate cluster about a seed document from the primary set of documents based on measured similarity, wherein merging initial clusters to form the aggregate cluster is based at least in part on a similarity between an event label tagged to the seed document and one or more features associated with the initial clusters; and a retrieval engine comprising: an event identification module adapted to identify an event of interest related to a received input retrieval expression; and a match module adapted to match the identified event of interest with one or more aggregate clusters; wherein the output of the digital communications interface is adapted to output for display at a computing device a representation of an aggregated cluster in response to the received input retrieval expression. 2. The system of claim 1 further comprising a graphic user interface adapted to present a graphic representation of the aggregated cluster set of documents via a display associated with the computing device. 3. The system of claim 1 , wherein the data set creation module comprises a recommendation classifier adapted to discriminate among documents to arrive at the candidate data set based on a set of criteria. 4. The system of claim 1 , wherein the aggregate cluster module is further adapted to execute an algorithmic similarity function to measure similarity between a set of digital signatures. 5. The system of claim 1 , wherein the initial clustering module is adapted to apply heuristic processes based on a set of features to first reduce the number of digital signatures compared in arriving at the initial cluster of document records. 6. The system of claim 1 wherein the data set creation module is further adapted to populate a candidate data set table, the initial cluster module is further adapted to populate an initial cluster table, and the aggregate cluster module is further adapted to populate an aggregate cluster table, wherein the aggregate cluster module applies an algorithm representing a set of document features stored in the initial cluster table to determine merging of initial clusters from the plurality of initial clusters into the aggregate cluster and storing data related to the aggregate cluster into the aggregate cluster table. 7. The system of claim 1 wherein the aggregate cluster module determines merging of clusters from the initial cluster set based on a determined similarity between two or more of: unstructured text contained in content received from the candidate data set; tagged entity names appearing in the candidate data set; and digital signatures derived from unstructured text contained in content from the candidate data set. 8. The system of claim 1 wherein the aggregate cluster module determines merging of clusters by analyzing data structures represented in vector form. 9. The system of claim 8 wherein a first vector representation of a digital signature associated with the unstructured text of a document is term-based and is used to determine a degree of overlap between two document representatives of their clusters and a second vector is tag-based and is associated with the structured text of a document in the cluster and is used to determine a degree of overlap between two document representatives of their clusters. 10. The system of claim 1 wherein the output of the digital communications interface is adapted to output for display at the computing device a graphical representation of an aggregated cluster. 11. A computer-based system connected via a communications network to a plurality of news content sources, the system comprising: a news repository database comprising a primary set of documents and a secondary set of documents, each of the primary set of documents having a story line feature and an assigned and predefined event label; a digital communications interface having an input and an output, the input adapted to retrieve information from the news repository database; an event clustering engine adapted to cluster documents from the news repository database about an event, the event clustering engine comprising: a data set creation module adapted to load a set of documents for potential news event clustering into a candidate data set, the candidate data set including documents from both the primary set of documents and the secondary set of documents; an initial cluster module adapted to generate digital signature metadata for each document in the set of documents, the digital signature metadata being separate from each document in the set of documents and comprising a data structure including an assigned event label representing a document topical nature derived from unstructured text for each document in the set of documents for the candidate data set, wherein the event label is a document feature stored in the data structure; the initial cluster module adapted to compare the digital signature metadata related to the candidate data set and 1) to identify and remove duplicate documents and 2) to cluster a set of documents from the candidate data set to form an initial cluster, the initial cluster module adapted to form a plurality of initial clusters, wherein each initial cluster is formed based at least in part on matching document event labels; and an aggregate cluster module adapted to execute an algorithmic similarity function to measure similarity between features related to initial clusters formed by the initial cluster module, the features including the event label feature, the aggregate cluster module further adapted to merge in whole or in part, based on measured s

Assignees

Thomson Reuters Entpr Centre Gmbh

Inventors

Classifications

G06F16/355
Creation or modification of classes or clusters · CPC title
G06F40/295
Named entity recognition · CPC title
G06F40/194Primary
Calculation of difference between files · CPC title
G06F16/334Primary
Query execution (filtering based on additional data G06F16/335) · CPC title
G06F16/358
Browsing; Visualisation therefor · CPC title

Patent family

Related publications grouped by family.

View patent family 59561526

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11663254B2 cover?: The present invention provides a seeded news event clustering and retrieval system configured to first create a candidate data set of documents, second create a set of initial clusters based on nearness or duplicate similarity status, and third create an aggregate cluster by merging initial clusters with seed documents. The invention generates top-level clusters for news events based on an edit…
Who is the assignee on this patent?: Thomson Reuters Entpr Centre Gmbh
What technology area does this patent fall under?: Primary CPC classification G06F40/194. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue May 30 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).