Navigational ranking for focused crawling

US9922119B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9922119-B2
Application numberUS-74120410-A
CountryUS
Kind codeB2
Filing dateNov 8, 2007
Priority dateNov 8, 2007
Publication dateMar 20, 2018
Grant dateMar 20, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods of navigational ranking for focused crawling are disclosed. In an exemplary embodiment, a method may include using a classifier to distinguish at least one target web page from other web pages on a website. The method may also include modeling the web pages on the website by a directed graph G=(V, E), wherein each web page is represented by a vertex (V), and a link between two web pages is represented by an edge (E). The method may also include assigning each web page (u) in V is assigned a weight p(u) based on the classifier to calculate a navigational ranking indicating relevance of a web page.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method comprising: accessing, by a computing device, a plurality of web pages; and determining, by the computing device, for each web page of the plurality of web pages, a navigational rank to a root node by, (a) determining a number of links pointing to the plurality of web pages; (b) setting an initial navigational rank of the web page to an initial value; (c) identifying navigational ranks of the web pages to which the web page points; (d) calculating a navigational rank of the web page based on a weight p(u) assigned to the web page (u) and an average of the identified navigational ranks of the web pages to which the web page points divided by the determined number of links pointing to the plurality of web pages; (e) normalizing the calculated navigational ranks of the web pages so that the calculated navigational ranks average to a certain numerical value; (f) determining whether a difference between the normalized navigational rank and the initial navigational rank of the web page is below a predefined error bound; (g) in response to the determined difference being less than the predefined error bound, setting the navigational rank of the web page to be the calculated navigational rank; (h) in response to the determined difference being below the predefined error bound; and repeating (c) to (h) for the web page based on the calculated navigational ranks of the web pages until the differences between the normalized navigational ranks and the initial navigational ranks of the plurality of web pages are each below the predefined error bound. 2. The method of claim 1 , wherein a higher weight p(u) corresponds to higher relevance of the web page to the root node. 3. The method of claim 1 , further comprising assigning each web page of the plurality of web pages a respective weight p(u) that corresponds to a likelihood that the web page is a target web page of the root node. 4. The method of claim 1 wherein the weight p(u) is a binary number or a real number. 5. The method of claim 1 wherein calculating the navigational rank further comprises calculating the navigational rank according to a static model, wherein under the static model, the plurality of web pages are downloaded and the navigational ranks of the plurality of web pages are determined from the downloaded plurality of web pages. 6. The method of claim 5 , further comprising calculating the navigational rank according to the static model where domain-specific crawling is implemented on many similar sites. 7. The method of claim 5 , further comprising determining the navigational rank of each of the plurality of web pages according to an active model based on the static model. 8. The method of claim 7 , wherein in the active model, a structure for each individual website is determined by dynamically adjusting a navigational rank of a node while crawling the website. 9. The method of claim 8 , wherein the active model is used in focused crawling for calculating navigational rank in real-time as subsets of the website are downloaded. 10. The method of claim 7 , wherein determining the navigational rank according to the active model is based on subsets of a website sequentially downloaded from the website to generate a graph of the website. 11. The method of claim 1 , further comprising modeling the plurality of web pages by a directed graph, wherein each of the plurality of web pages is represented by a vertex and links between the plurality of web pages are represented by edges. 12. The method of claim 1 , wherein calculating the navigational rank comprises using a one-direction score propagation strategy, from offspring to ancestor web pages. 13. The method of claim 1 , further comprising discovering a target page by following a shortest path from the root node and the navigational ranks of the web pages. 14. The method of claim 1 , further comprising approximating a navigational rank of each page encountering while crawling a new website, wherein a web page from the new website having a highest navigational rank is expanded. 15. A system of navigational ranking for focused crawling, comprising: a processor; a non-transitory computer-readable data storage medium storing program code executable by the processor to: access a plurality of web pages on a web site; assign each web page of the plurality of web pages a weight indicating relevance of the web page to a root web page; for each web page of the plurality of web pages, determine a navigational rank to the root web page, wherein to determine the navigational rank of each web page to, (a) determine a number of links pointing to the plurality of web pages; (b) set an initial navigational rank of the web page to an initial value; (c) identify navigational ranks of the web pages to which the web page points; (d) calculate a navigational rank of the web page based on a weight assigned to the web page and an average of the identified navigational ranks of the web pages to which the web page points divided by the determined number of links pointing to the plurality of web pages; (e) normalize the calculated navigational ranks of the web pages so that the calculated navigational ranks average to a certain numerical value; (f) determine whether a difference between the normalized navigational rank and the initial navigational rank of the web page is below a predefined error bound: (g) in response to the determined difference being less than the predefined error bound, set the navigational rank of the web page to be the calculated navigational rank; (h) in response to the determined difference being below the predefined error bound; and repeat (c) to (h) for the web page based on the calculated navigational ranks of the web pages until the differences between the normalized navigational ranks and the initial navigational ranks of the plurality of web pages are each below the predefined error bound. 16. The system of claim 15 , wherein a higher weight corresponds to higher relevance of the web page to the root web page. 17. The system of claim 15 , wherein the program code is further to cause the processor to assign each web page of the plurality of web pages a respective weight that corresponds to a relevance that the web page is a target web page of the root web page. 18. The system of claim 15 wherein the weight is a binary number or a real number. 19. The system of claim 15 , wherein to calculate the navigational rank, the program code is further to cause the processor to calculate the navigational rank according to a static model, wherein under the static model, the plurality of web pages are downloaded and the navigations ranks of the plurality of web pages are determined from the downloaded plurality of web pages. 20. The system of claim 15 , wherein the program code is further to cause the processor to determine the navigational rank of each of the plurality of web pages according to an active model. 21. The system of claim 20 , wherein to determine the navigational rank according to the active model, the program code is further to cause the processor to determine the navigational rankings based on subsets of the website sequentially downloaded from the website to generate a graph of the website. 22. A non-transitory computer-readable data storage medium on which is stored machine readable instructions that when executed by a processor cause the processor to: access a plurality of web pages on a website;

Assignees

Inventors

Classifications

  • G06F16/95Primary

    Retrieval from the web · CPC title

  • Trees, e.g. B+trees · CPC title

  • Indexing structures · CPC title

  • Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking · CPC title

  • Browsing optimisation, e.g. caching or content distillation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9922119B2 cover?
Systems and methods of navigational ranking for focused crawling are disclosed. In an exemplary embodiment, a method may include using a classifier to distinguish at least one target web page from other web pages on a website. The method may also include modeling the web pages on the website by a directed graph G=(V, E), wherein each web page is represented by a vertex (V), and a link between t…
Who is the assignee on this patent?
Zhang Li, Feng Shi Cong, Xiong Yuhong, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F16/95. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 20 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).