Annotating HTML segments with functional labels

US9594730B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9594730-B2
Application numberUS-82926510-A
CountryUS
Kind codeB2
Filing dateJul 1, 2010
Priority dateJul 1, 2010
Publication dateMar 14, 2017
Grant dateMar 14, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and apparatus is described for assigning functional labels to segments of web pages in an application-independent way. In the approach described herein, one of a generic set functional labels are automatically assigned to each segment of a web page, where the generic functional labels may be topic-independent and application-independent. Applications with different needs can determine which segments of the web page to process based on which functional labels correspond to the types of information needed by each application. Thus, the work of classifying the function of each segment of a web page is separated from the work of selecting which segments satisfy the need of a particular application. The work of classification can be performed in an application-independent way, relieving the burden from every application developer from having to create their own classifiers.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: processing a web page to determine a plurality of segments, wherein each segment from the plurality of segments includes one or more HTML elements; each machine-based classifier of a plurality of machine-based classifiers generating, based at least upon metadata associated with two or more segments from the plurality of segments that indicates one or more presentation features in the HTML elements of the two or more segments from the plurality of segments, a probability output for each segment of the two or more segments from the plurality of segments, wherein each functional category from the plurality of functional categories corresponds to a functional role of HTML elements in the web page; wherein each machine-based classifier from the plurality of machine-based classifiers corresponds to a functional category from the plurality of functional categories; assigning, based on the plurality of probability output, one or more functional categories to each segment of the two or more segments; a first application selecting a first set of functional categories from the plurality of functional categories; a second application that is different than the first application selecting a second set of functional categories from the plurality of functional categories, wherein the second set of functional categories does not include functional categories from the first set of functional categories; the first application selecting for processing, based upon the first set of functional categories and the functional categories assigned to the two or more segments, a first set of one or more segments from the two or more segments; the second application selecting for processing, based upon the second set of functional categories and the functional categories assigned to the two or more segments, a second set of one or more segments from the two or more segments, wherein the second set of one or more segments includes at least one segment that is not in the first set of one or more segments and the first set of one or more segments includes at least one segment that is not in the second set of one or more segments; the first application processing content contained in the first set of one or more segments and not processing content contained in the second set of one or more segments; the second application processing content contained in the second set of one or more segments and not processing content contained in the first set of one or more segments; and wherein the method is performed by one or more computing devices. 2. The method of claim 1 , wherein the first set of one or more segments does not include a segment assigned a main content functional category, wherein segments assigned the main content functional category contain the most text that is relevant to the topic of the web page. 3. The method of claim 1 , wherein the second application indexes the web page for a search engine by ignoring the content of each segment from the two or more segments that is assigned an advertisement functional category or a site navigation functional category. 4. The method of claim 1 , wherein the functional categories assigned to the two or more segments are independent of a topic of one or more topics associated with the web page. 5. The method of claim 1 , wherein the plurality of functional categories includes at least one of: a) user-generated content; b) site navigation; or c) boiler-plate. 6. The method of claim 1 , wherein the first application determines a topic for the web page by processing the content contained in the first set of one or more segments, wherein the first set of one or more segments includes segments that are assigned a functional category of: a) main content b) content pointers; and c) site navigation; wherein the first set of one or more segments does not include segments that are assigned a functional category of a) advertisement; or b) user-generated content. 7. The method of claim 1 , wherein the first application determines a topic for the web page by analyzing a frequency of words that appear as text within the first set of one or more segments; and wherein the frequency of words that appear as text within segments of the first set of one or more segments assigned an advertisement functional category or user-generated content functional category are not analyzed. 8. The method of claim 1 , wherein the metadata associated with the two or more segments includes at least one of: a) amount of screen space taken to display the segment; b) height or width of the screen space taken to display the segment; c) number of hyperlinks contained within the segment; d) amount of text within the segment; e) ratio of hyperlinks to text within the segment; f) types of html elements contained within the segment; or g) color of the section and whether the color of the segment is different from the color of other segments. 9. The method of claim 1 , wherein assigning one or more functional categories further comprises: training a machine-based classifier based on the metadata associated with the two or more segments and functional category data provided by human editors; the machine-based classifier improving the accuracy of functional category assignment based on regression techniques. 10. The method of claim 1 , wherein the first application is a web crawler and the first set of one or more segments comprises a set of segments that are assigned a main content functional category. 11. One or more non-transitory computer-readable media storing instructions which, when processed by one or more processors, cause: processing a web page to determine a plurality of segments, wherein each segment from the plurality of segments includes one or more HTML elements; each machine-based classifier of a plurality of machine-based classifiers generating, based at least upon metadata associated with two or more segments from the plurality of segments that indicates one or more presentation features in the HTML elements of the two or more segments from the plurality of segments, a probability output for each segment of the two or more segments from the plurality of segments, wherein each functional category from the plurality of functional categories corresponds to a functional role of HTML elements in the web page; wherein each machine-based classifier from the plurality of machine-based classifiers corresponds to a functional category from the plurality of functional categories; assigning, based on the plurality of probability output, one or more functional categories to each segment of the two or more segments; a first application selecting a first set of functional categories from the plurality of functional categories; a second application that is different than the first application selecting a second set of functional categories from the plurality of functional categories, wherein the second set of functional categories does not include functional categories from the first set of functional categories; the first application selecting for processing, based upon the first set of functional categories and the functional categories assigned to the two or more segments, a first set of one or more segments from the two or more segments; the second application selecting for processing, based upon the second set of functional categories and the functional categories assigned to the two or more segments, a second set of one or more segments from the two or more segments, wherein the second set of one or more segments includes at least one segment that is not in the first set of one or more segments and the first set of one or more seg

Assignees

Inventors

Classifications

  • G06F40/137Primary

    Hierarchical processing, e.g. outlines · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9594730B2 cover?
A method and apparatus is described for assigning functional labels to segments of web pages in an application-independent way. In the approach described herein, one of a generic set functional labels are automatically assigned to each segment of a web page, where the generic functional labels may be topic-independent and application-independent. Applications with different needs can determine …
Who is the assignee on this patent?
Rajan Suju, Gaffney Scott J, Punera Kunal, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F40/137. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 14 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).