What technology area does this patent fall under?

Primary CPC classification G06F16/951. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Sep 01 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Self-learning based crawling and rule-based data mining for automatic information extraction

US10762437B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10762437-B2
Application number	US-201615077563-A
Country	US
Kind code	B2
Filing date	Mar 22, 2016
Priority date	Jun 19, 2015
Publication date	Sep 1, 2020
Grant date	Sep 1, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and Systems for automatic information extraction by performing self-learning crawling and rule-based data mining is provided. The method determines existence of crawl policy within input information and performs at least one of front-end crawling, assisted crawling and recursive crawling. Downloaded data set is pre-processed to remove noisy data and subjected to classification rules and decision tree based data mining to extract meaningful information. Performing crawling techniques leads to smaller relevant datasets pertaining to a specific domain from multi-dimensional datasets available in online and offline sources.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method for automatic information extraction comprising: receiving a request for information extraction and retrieving input information from the request; determining existence of a crawl policy wherein such determination is performed on the input information retrieved from the request; performing assisted crawling, for the input information containing the crawl policy and performing recursive crawling for the input information not containing the crawl policy; computing valid paths and links for building a new crawl policy based on the assisted crawling and the recursive crawling, wherein the valid paths and links are computed recursively such that the links in a destination file and a web-page in a current crawling cycle matches with one or more previous crawling attempts; pre-processing dataset containing the new crawl policy obtained after the assisted crawling and the recursive crawling to remove noisy data and to obtain a pre-processed dataset; and subjecting the pre-processed relevant dataset to classification rules and decision tree based data mining to obtain extracted information. 2. The method of claim- 1 , wherein the request for automatic information extraction can be system based or user-based. 3. The method of claim- 1 , wherein the input information includes product information as at least one of search template and pattern template. 4. The method of claim- 3 , wherein the search template includes product names for which information is to be extracted. 5. The method of claim- 3 , wherein the pattern template includes patterns leading to destination web-pages. 6. The method of claim- 1 , wherein the assisted crawling is performed based on prioritization policy. 7. The method of claim- 1 , further comprising: determining whether the valid paths and links computed during recursive crawling need to be saved for future attempts of assisted crawling. 8. The method of claim- 1 , wherein a complexity of assisted crawling is determined according to an expression: Complexity: O ( n×m+n×k ), where O is a notation of complexity, n is a number of products to update, k is a number of web-pages downloaded and m is a complexity involved for crawling. 9. The method of claim- 1 , wherein a complexity of assisted crawling is a function of a number of products to update, a number of web-pages downloaded and a complexity involved for crawling. 10. The method of claim- 1 , wherein a complexity of recursive crawling is determined according to an expression: Complexity: O ( n×v 2 +n×k ), where O is a notation of complexity, n is a number of products to update, k is a number of web-pages downloaded and v is a number of hops required to reach the destination source for recursive crawling. 11. The method of claim- 1 , wherein complexity of recursive crawling is a function of the number of products to update, the number of web-pages downloaded and the number of hops required to reach the destination source for recursive crawling. 12. The method of claim- 1 , wherein the classification rules includes generic rules and specific rules. 13. The method of claim- 1 , wherein new rules can be formulated and added to classification rules. 14. The method of claim- 1 , further comprising intimating non-existence of the crawl policy. 15. The method of claim- 1 , further comprising: intimating any conflict caused due to one or more classification rules; and extracting information irrespective of type of file formats. 16. The method of claim- 1 , further comprising: receiving a request for information retrieval and retrieving input information from the request; providing the input information for performing front-end crawling; pre-processing dataset obtained after the front-end crawling to remove noisy data and to obtain the pre-processed dataset; and subjecting the pre-processed relevant dataset to classification rules and decision tree based data mining to extract information. 17. The method of claim- 16 , wherein the input information includes configuration files containing data dictionary for mapping data source. 18. The method of claim- 16 , wherein a complexity of front-end crawling is determined according to an expression: Complexity: O ( n×m+n×k ), where O is a notation of complexity, n is a number of products to update, k is a number of web-pages downloaded and m is a complexity involved for front-end crawling through website or a number of hops to arrive at destination source. 19. The method of claim- 16 , wherein a complexity of front-end crawling is a function of a number of products to update, a number of web-pages downloaded and a complexity involved for front-end crawling through website or a number of hops to arrive at destination source. 20. A computer implemented system for automatic information extraction comprising: an input module for receiving a request for information extraction and retrieving information from the request; a data source for information extraction; one or more processor configured to: responsive to the request for information extraction: determine the existence of a crawl policy, wherein such determination is performed within input information retrieved from the request; perform at least one of the front-end crawling, assisted crawling and recursive crawling on the input information; computing valid paths and links for building a new crawl policy based on the assisted crawling and the recursive crawling, wherein the valid paths and links are computed recursively such that the links in a destination file and a web-page in a current crawling cycle matches with one or more previous crawling attempts; pre-processing dataset containing the new crawl policy obtained after the assisted crawling and the recursive crawling to remove noisy data and to obtain a pre-processed dataset; an extractor to subject pre-processed data to classification rules and decision tree based data mining techniques; and an output module to provide extracted information. 21. The system of claim- 20 , wherein data source include online and offline sources. 22. A non-transitory computer readable medium embodying a program executable in a computing device for automatic information extraction, the program comprising: receiving a request for information extraction and retrieving input information from the request; determining existence of a crawl policy wherein such determination is performed on the input information retrieved from the request; performing assisted crawling, for the input information containing the crawl policy and performing recursive crawling for the input information not containing the crawl policy; computing valid paths and links for building a new crawl policy based on the assisted crawling and the recursive crawling, wherein the valid paths and links are computed recursively such that the links in a destination file and a web-page in a current crawling cycle matches with one or more previous crawling attempts; pre-processing dataset containing the new crawl policy obtained after the assisted crawling and the recursive crawling to remove noisy data and to obtain a pre-processed dataset; and subjecting the pre-processed relevant dataset to classification rules and decision tree based data mining to obtain extracted information.

Assignees

Tata Consultancy Services Ltd

Inventors

Classifications

G06F16/951Primary
Indexing; Web crawling techniques · CPC title
G06N20/00Primary
Machine learning · CPC title
G06F16/95
Retrieval from the web · CPC title
G06N5/045
Explanation of inference; Explainable artificial intelligence [XAI]; Interpretable artificial intelligence · CPC title

Patent family

Related publications grouped by family.

View patent family 55587127

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10762437B2 cover?: Methods and Systems for automatic information extraction by performing self-learning crawling and rule-based data mining is provided. The method determines existence of crawl policy within input information and performs at least one of front-end crawling, assisted crawling and recursive crawling. Downloaded data set is pre-processed to remove noisy data and subjected to classification rules and…
Who is the assignee on this patent?: Tata Consultancy Services Ltd
What technology area does this patent fall under?: Primary CPC classification G06F16/951. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Sep 01 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).