What technology area does this patent fall under?

Primary CPC classification G06Q30/0204. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 04 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Method and apparatus for identifying webpage type

US10311120B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10311120-B2
Application number	US-201514627311-A
Country	US
Kind code	B2
Filing date	Feb 20, 2015
Priority date	Aug 22, 2012
Publication date	Jun 4, 2019
Grant date	Jun 4, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Various embodiments provide a method and an apparatus for identifying webpage type. The method includes: judging whether a web address to be classified matches with a webpage classification rule in at least two webpage classification rules; and determining the type of the webpage to be a type corresponding to a webpage classification rule which matches with the web address.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for identifying webpage type, comprising: at a device having a processor and a screen, reading pre-stored web addresses of a webpage type, obtaining a collection of string components of the web addresses by parsing the web addresses; converging web addresses having at least one identical string component into one group according to a pre-defined converging method to generate multiple groups; determining that a coverage rate of a group meets a requirement in response to a determination that a total number of webpages in the group is smaller than or equal to a first threshold and determining that an identification accuracy of the group meets the requirement in response to a determination that an entropy is smaller than a second threshold; determining the coverage rate and the identification accuracy of the group do not meet the requirement in response to a determination that the total number of webpages in the group is larger than the first threshold or the entropy is larger than or equal to a second threshold; wherein the entropy satisfies E=sum(pi*log(pi)), i=1, 2 . . . , n, wherein n is the total number of webpages in the group, pi is a probability of webpages of a same type occurring in the group; terminating converging in response to the determination that the coverage rate and the identification accuracy meet the requirement; generating a webpage classification rule using the multiple groups and the webpage type, and storing the webpage classification rule into a webpage classification rule base; judging whether a web address of a webpage to be classified matches a webpage classification rule; determining a type of the webpage to be a type corresponding to a webpage classification rule which matches the web address; in response to a judgment that the web address of the webpage to be classified does not match the webpage classification rule, using a classifier trained using a machine learning algorithm based on web addresses to determine the webpage type of the webpage to be classified; extracting a content of the webpage selectively according to the webpage type; and displaying, on the screen, the content to a user in a pre-defined manner corresponding to the webpage type. 2. The method of claim 1 , wherein the webpage classification rule comprises a string expression associated with a webpage type, the string expression is extracted from a plurality of first web addresses pre-classified into the webpage type, the string expression describes characteristics shared by the plurality of first web addresses and comprises a first description string component describing a domain name and at least a second description string component describing at least one web address string component sequentially following the domain name in the plurality of first web addresses. 3. The method of claim 2 , wherein judging whether the web address of the webpage to be classified matches the webpage classification rule in at least two webpage classification rule components comprises: extracting domain name and another string component from the web address of the webpage to be classified; and judging whether the extracted domain name and string components matches string expression corresponding to the webpage classification rule. 4. The method of claim 3 , wherein judging whether the extracted domain name and string components matches string expression corresponding to the webpage classification rule comprises: determining whether the domain name matches the first description string component of the webpage classification rule; and determining whether the another string component matches the second description string component of the webpage classification rule. 5. The method of claim 3 , further comprising: storing the web address and the webpage type of the web page to be classified in response to a determination that the domain name and the another string component of the web address of the webpage to be classified match the string expression of the webpage classification rule. 6. The method of claim 2 , wherein extracting the string expression comprises: extracting a description of shared characteristics of web address string components in each of at least one tier of the plurality of first web addresses sequentially starting from a tier for domain names of the plurality of first web addresses to obtain at least one description corresponding to the at least one tier; sequentially arranging the at least one description corresponding to the at least one tier according to an order of the at least one tier arranged in the plurality of first web addresses to obtain the string expression. 7. The method of claim 6 , wherein extracting the description of shared characteristics of the web address string components in each of at least one tier of the plurality of first web addresses comprises: arranging the plurality of first web addresses into a tree where a node for a web address string component of a first tier which follows a second tier in web addresses serves as a child node of a node for a web address string component of the second tier; and converging a plurality of nodes at a same tier having a same parent node into one node whose value is a description of shared characteristics of the plurality of nodes; wherein the string expression comprises a value of each node which is an only child node of a parent node, the parent node is a node for domain name or a descendant node of the node for domain name. 8. The method of claim 6 , wherein extracting the description of shared characteristics of web address string components comprises: converting each of the web address string components into a string according to a pre-determined converting method; and in response to a determination that the web address string components are converted into a same string, determining the string to be the description of shared characteristics of the web address string components. 9. An apparatus for identifying webpage type, comprising: at least one processor; a display screen; and memory for storing computer-readable instructions, wherein the at least one processor, when executing the computer-readable instructions, is configured to: read pre-stored web addresses of a webpage type, obtaining a collection of string components of the web addresses by parsing the web addresses; converge web addresses having at least one identical string component into one group according to a pre-defined converging method to generate multiple groups; determine that a coverage rate of a group meets a requirement in response to a determination that a total number of webpages in the group is smaller than or equal to a first threshold and determining that an identification accuracy of the group meets the requirement in response to a determination that an entropy is smaller than a second threshold; determine the coverage rate and the identification accuracy of the group do not meet the requirement in response to a determination that the total number of webpages in the group is larger than the first threshold or the entropy is larger than or equal to a second threshold; wherein the entropy satisfies E=sum(pi*log(pi)), i=1, 2 . . . , n, wherein n is the total number of webpages in the group, pi is a probability of webpages of a same type occurring in the group; terminate converging in response to the determination that the coverage rate and the identification accuracy meet the requirement; generate a webpage classification rule using the multiple groups and the webpage type, and storing the webpage classification rule into a webpage classification rule base; judge whether a web address of a webpage to be classified matches a we

Assignees

Tencent Tech Shenzhen Co Ltd

Inventors

Cai Bing

Classifications

G06Q30/0204Primary
Market segmentation · CPC title
G06F16/954Primary
Navigation, e.g. using categorised browsing · CPC title
G06F16/958
Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking · CPC title
G06F16/95
Retrieval from the web · CPC title
G06F16/955Primary
using information identifiers, e.g. uniform resource locators [URL] · CPC title

Patent family

Related publications grouped by family.

View patent family 50149442

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10311120B2 cover?: Various embodiments provide a method and an apparatus for identifying webpage type. The method includes: judging whether a web address to be classified matches with a webpage classification rule in at least two webpage classification rules; and determining the type of the webpage to be a type corresponding to a webpage classification rule which matches with the web address.
Who is the assignee on this patent?: Tencent Tech Shenzhen Co Ltd
What technology area does this patent fall under?: Primary CPC classification G06Q30/0204. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 04 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).