System and method for automatically summarizing documents pertaining to a predefined domain

US11074303B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11074303-B2
Application numberUS-201815984479-A
CountryUS
Kind codeB2
Filing dateMay 21, 2018
Priority dateMay 21, 2018
Publication dateJul 27, 2021
Grant dateJul 27, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed is a system for automatically summarizing documents pertaining to a predefined domain. A document finder module enables a web crawler to crawl web resources in order to find a plurality of documents. A keyword determination module determines a set of keywords from the plurality of documents and a rank associated to each keyword of the set of keywords. A clustering module clusters the plurality of documents into one or more clusters. A score computation module identifies a subset of the set of keywords for each cluster upon computing a similarity score, corresponding to each keyword, for each cluster. A summary generation module generates a summary for each cluster based on presence of one or more keywords, of the subset, in each document classified in the cluster.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for automatically summarizing documents pertaining to a predefined domain, the method comprising: enabling, by a processor, a web crawler to crawl web resources in order to find a plurality of documents associated to a plurality of predefined domains; determining, by the processor, a set of keywords, relevant to each predefined domain, from the plurality of documents found by the web crawler, and a rank associated to each keyword of the set of keywords, wherein the set of keywords and the rank are determined by using at least one keyword extraction algorithm based on text rank; clustering, by the processor, the plurality of documents into one or more clusters by extracting a set of features for each document in order to make Deep Convolution Neural Networks (Deep CNN) learn the association of each document with one or more predefined domains, and classifying each document into a cluster based on the set of features learnt by the Deep CNN; wherein the set of features are referred as n-gram and are generated using convolution or filter of different sizes, identifying, by the processor, a subset of the set of keywords for each cluster upon computing a similarity score, corresponding to each keyword, for each cluster, wherein the similarity score indicates relevance of a keyword with the cluster; and generating, by the processor, a summary for each cluster based on presence of one or more keywords, of the subset, in each document classified in the cluster thereby automatically summarizing documents pertaining to the predefined domain and keep the user informed about a latest updates regarding a update in technological subject, learning, by the processor, when the system exposed to new environment, wherein learning comprises an active learning and a reinforcement learning, wherein the reinforcement learning is used to train an AI system and agents share information with other agents in terms of model parameters to cluster the plurality of documents into one or more clusters, each bot uses a reinforcement learning method to train the AI system and a label generation module uses the active learning, wherein based on reinforcement learning, details of one or more clusters allow users to provide feedback to vet assignation of the plurality of documents onto the one or mote dusters. 2. The method as claimed in claim 1 , wherein the at least one keyword extraction algorithm comprises Computational linguistic techniques including Term Frequency-Inverse Document Frequency (TF-IDF). 3. The method as claimed in claim 1 , wherein the similarity score is computed based on the rank, determined, and a Part of speech score predefined for each keyword. 4. The method as claimed in claim 1 , wherein the set of features comprises Number of title words, Number of phrase relevant to title, sentence location, context meaning of information associated to a document. 5. The method as claimed in claim 1 , wherein the plurality of documents is clustered into the one or more clusters upon applying a Best Match algorithm on information described therein each document. 6. The method as claimed in claim 1 , wherein the summary for each cluster is generated by, identifying a set of sentences, from each document classified in the cluster, having the presence of the one or more keywords of the subset; computing a confidence score corresponding to each sentence of the set of sentences, wherein the confidence score is computed based on a frequency of occurrence pertaining to each of the one or more keywords in the set of sentences and uniqueness of each sentence in the set of sentences; determining a set of candidate sentences from the set of sentences based on confidence score; and generating the summary based on the set of candidate sentences. 7. A system for automatically summarizing documents pertaining to a predefined domain, the system comprising: a processor; and a memory coupled to the processor, wherein the processor is capable of executing a plurality of modules stored in the memory, and wherein the plurality of modules comprising: a document finder module for enabling a web crawler to crawl web resources in order to find a plurality of documents associated to a plurality of predefined domains; a keyword determination module for determining a set of keywords, relevant to each predefined domain, from the plurality of documents found by the web crawler, and a rank associated to each keyword of the set of keywords, wherein the set of keywords and the rank are determined by using at least one keyword extraction algorithm based on text rank; a clustering module for clustering the plurality of documents into one or more clusters by, extracting a set of features for each document in order to make Deep Convolution Neural Networks (Deep CNN) learn the association of each document with one or more predefined domains, and classifying each document into a cluster based on the set of features learnt by the Deep CNN, wherein the set of features are referred as n-gram and are generated using convolution or filter of different sizes; a score computation module for identifying a subset of the set of keywords for each cluster upon computing a similarity score, corresponding to each keyword, for each cluster, wherein the similarity score indicates relevance of a keyword with the cluster; and a summary generation module for generating a summary for each cluster based on presence of one or more keywords, of the subset, in each document classified in the cluster thereby automatically summarizing documents pertaining to the predefined domain and keep the user informed about a latest updates regarding a updates in technological subject; learning, by the processor, when the system exposed to new environment, wherein learning comprises an active learning and a reinforcement learning, wherein the reinforcement learning is used to train an AI system and agents share information with other agents in terms of model parameters to cluster the plurality of documents into one or more clusters, each bot uses a reinforcement learning method to train the AI system and a label generation module uses the active learning, wherein based on reinforcement learning, details of the one or more clusters allow users to provide feedback to vet assignation of the plurality of documents onto the one or more clusters. 8. The system as claimed in claim 7 , wherein the score computation module computes the similarity score based on the rank, determined, and a Part of speech score predefined for each keyword. 9. The system as claimed in claim 7 , wherein the clustering module clusters the plurality of documents into the one or more clusters upon applying a Best Match algorithm on information described therein each document. 10. The system as claimed in claim 7 , wherein the summary generation module generates the summary for each cluster is generated by, identifying a set of sentences, from each document classified in the cluster, having the presence of the one or more keywords of the subset; computing a confidence score corresponding to each sentence of the set of sentences, wherein the confidence score is computed based on a frequency of occurrence pertaining to each of the one or more keywords in the set of sentences and uniqueness of each sentence in the set of sentences; determining a set of candidate sentences from the set of sentences based on confidence score; and generating the summary based on the set of candidate sentences. 11. A non-transitory computer readable medium embodying a program executable in a computing device for automatically summarizing documents pertaining to a predefined domain, the program comprising:

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • G06N3/08Primary

    Learning methods · CPC title

  • Reinforcement learning · CPC title

  • Transfer learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11074303B2 cover?
Disclosed is a system for automatically summarizing documents pertaining to a predefined domain. A document finder module enables a web crawler to crawl web resources in order to find a plurality of documents. A keyword determination module determines a set of keywords from the plurality of documents and a rank associated to each keyword of the set of keywords. A clustering module clusters the …
Who is the assignee on this patent?
Hcl Technologies Ltd
What technology area does this patent fall under?
Primary CPC classification G06N3/08. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 27 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).