System and method for identifying website verticals

US9330168B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9330168-B1
Application numberUS-201414180273-A
CountryUS
Kind codeB1
Filing dateFeb 13, 2014
Priority dateFeb 19, 2010
Publication dateMay 3, 2016
Grant dateMay 3, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for the categorization of websites are presented. A website is categorized using one or a combination of its domain name and its web page content. The domain name is tokenized, and the tokens compared to categories in a category structure to determine probabilities that the token belongs to each category. Combinations of tokens are similarly compared to the categories. A category may be determined with reference to a vector space in which a training set of websites having known categories is converted according to a methodology into reference vectors containing keyword frequencies. A target website is converted to a target vector using the same methodology, and a distance score of the target vector to each reference vector is calculated. The website represented by the target vector is assigned the category of the reference vector having the lowest distance score.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method, comprising: receiving, by at least one server communicatively coupled to a network, a list of a plurality of first keywords, the plurality of first keywords obtained by scraping each web page of a plurality of web pages in a target website, each of the plurality of web pages having at least one of the first keywords obtained therefrom; converting, by the at least one server, the list into a target vector representing the target website, the target vector comprising a plurality of elements each associated with a corresponding second keyword of a plurality of second keywords, the plurality of second keywords being selected from a corpus of websites, by: counting the number of times each second keyword of the plurality of second keywords appears in the list to produce a corresponding frequency of appearance of each second keyword in the target website; and storing, in each element of the plurality of elements, the corresponding frequency of appearance of the corresponding second keyword; comparing, by the at least one server, the target vector to a plurality of reference vectors each being assigned one or more categories of a category structure; and assigning, by the at least one server, the assigned one or more categories of the closest matching reference vector to the target website. 2. The method of claim 1 , wherein the corpus of websites comprises all publicly available websites on the Internet. 3. The method of claim 1 , wherein the corpus of websites comprises all publicly available websites previously categorized in one or more of the categories. 4. The method of claim 1 , wherein the corpus of websites comprises all publicly available websites previously categorized in one or more of the categories that are assigned to one or more of the reference vectors. 5. The method of claim 1 , wherein the plurality of second keywords is obtained by scraping the websites of the corpus for website data and identifying the second keywords from the website data. 6. The method of claim 5 , wherein the corresponding frequency of appearance of each second keyword of the plurality of second keywords is a term frequency-inverse document frequency (TF-IDF) score for the second keyword, the method further comprising calculating, by the at least one server, the TF-IDF score for each of the second keywords from the input and the corpus of websites. 7. The method of claim 6 , wherein calculating the TF-IDF score for each of the second keywords comprises: calculating an inverse document frequency (IDF) for the second keyword; counting the number of appearances of the second keyword in the plurality of first keywords obtained from the plurality of web pages in the target website; multiplying the number of appearances of the second keyword by the IDF of the second keyword to obtain the TF-IDF score for the second keyword; and storing the TF-IDF score in the element of the target vector associated with the second keyword. 8. The method of claim 1 , further comprising: receiving, by the at least one server, a training set of websites; receiving, by the at least one server, assigned categories for each of the websites in the training set; and converting each of the websites in the training set into one of the reference vectors, each of the reference vectors comprising elements signifying the appearance of the plurality of second keywords on the associated website. 9. The method of claim 8 , further comprising calculating, by the at least one server, an inverse document frequency (IDF) for each of the plurality of second keywords by: calculating the number of websites in the corpus that contain the second keyword, the corpus including the websites in the training set and one or both of the input and the target website; calculating a document frequency comprising the number of websites in the corpus that contain the second keyword, divided by the number of websites in the corpus; and calculating the natural logarithm of the document frequency to obtain the IDF. 10. The method of claim 9 , wherein converting each of the websites in the training set into a corresponding reference vector of the plurality of reference vectors comprises: creating the corresponding reference vector containing an element for each of the plurality of second keywords; and for each of the second keywords: counting the number of appearances of the second keyword in the website in the training set; multiplying the number of appearances of the second keyword by the IDF of the second keyword to obtain a TF-IDF score for the second keyword; and storing the TF-IDF score in the element of the corresponding reference vector associated with the second keyword. 11. The method of claim 1 , wherein comparing the target vector to the plurality of reference vectors comprises calculating a distance score of each of the reference vectors from the target vector, the closest matching reference vector having the lowest distance score. 12. The method of claim 11 , wherein calculating the distance score for each reference vector comprises: calculating the target vector norm; calculating the reference vector norm; calculating the dot product of the target vector and the reference vector; dividing the dot product by the product of the target vector norm and the reference vector norm to obtain a cosine similarity value; and calculating the arccosine of the cosine similarity value to obtain the distance score. 13. The method of claim 12 , wherein the corpus comprises all publicly available websites on the Internet. 14. A system, comprising: at least one server computer in communication with a network, the at least one server computer including a processor configured to: receive a list of a plurality of first keywords each collected from one of a plurality of web pages of a target website, each of the plurality of web pages having at least one of the first keywords collected therefrom; create a target vector representing the target website, the target vector comprising a plurality of elements each signifying a frequency of appearance of a corresponding second keyword of a plurality of second keywords within the target website, the plurality of second keywords being selected from a corpus of websites; determine, for each second keyword of the plurality of second keywords, a corresponding count of the number of times the second keyword appears in the list; determine, for each element of the plurality of elements, a corresponding value based on the corresponding count of the corresponding second keyword; compare the target vector to a plurality of reference vectors each being assigned one or more categories of a category structure; and assign the assigned one or more categories of the closest matching reference vector to the target website. 15. The system of claim 14 , wherein the corpus comprises all publicly available websites on the Internet. 16. The system of claim 14 , wherein comparing the target vector to the plurality of reference vectors comprises calculating a distance score of each of the reference vectors from the target vector, the closest matching reference vector having the lowest distance score. 17. The system of claim 14 , wherein the processor is further configured to, for each second keyword of the plurality of second keywords: calculate a corresponding inverse document frequency (IDF) for the second keyword; multiply the corresponding count of the second keyword by the IDF of the second keyword to obtain a TF-IDF score for the second keyword; and s

Assignees

Inventors

Classifications

  • Indexing; Web crawling techniques · CPC title

  • Parsing · CPC title

  • Navigation, e.g. using categorised browsing · CPC title

  • G06F16/355Primary

    Creation or modification of classes or clusters · CPC title

  • Price estimation or determination · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9330168B1 cover?
Systems and methods for the categorization of websites are presented. A website is categorized using one or a combination of its domain name and its web page content. The domain name is tokenized, and the tokens compared to categories in a category structure to determine probabilities that the token belongs to each category. Combinations of tokens are similarly compared to the categories. A cat…
Who is the assignee on this patent?
Go Daddy Operating Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/355. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 03 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).