Web knowledge extraction for search task simplification

US9020947B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9020947-B2
Application numberUS-201113307836-A
CountryUS
Kind codeB2
Filing dateNov 30, 2011
Priority dateNov 30, 2011
Publication dateApr 28, 2015
Grant dateApr 28, 2015

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are described for generating structured information from semi-structured web pages, and retrieving the structured knowledge in response to a user query that indicates a query intent. The structured information is automatically extracted offline from semi-structured web pages, through the use of an auto wrapper solution that is noise tolerant, scalable, and automatic. The structured information is stored in a knowledge base, and provided in response to a user search query that indicates a query intent. Extraction of structured information may also include clustering of pages based on their measured similarities. The clusters may be determined based on similar elements in the tag path text data of the pages. A minimum size threshold may be applied to the clusters.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: under control of one or more processors configured with executable instructions: forming a cluster for a plurality of web pages based at least in part on a similar characteristic of the plurality of web pages; transforming data of the cluster to determine one or more occurrence vectors and one or more position vectors for data items of the plurality of web pages; determining a root template for the cluster based on the one or more occurrence vectors; determining one or more detail templates for the cluster based on the one or more position vectors, wherein determining the one or more detail templates comprises: dividing document object model (DOM) trees of the plurality of web pages of the cluster into a plurality of blocks based at least in part on a plurality of tag path text items present in the root template; detecting one or more patterns in each block of the plurality of blocks to identify additional data fields; identifying an additional data field of the additional data fields to be one of a first type or a second type in a block of the plurality of blocks, the identifying enabling the block to be split into a plurality of new blocks, wherein the first type comprises a data field located after a tag path text item which occurs once in at least a predetermined number of web pages and the second type comprises a data field following an equivalence class that occurs one or more times and for which no other tag path is present; and repeating the detecting and the identifying for the plurality of new blocks; and extracting structured information for the cluster based on the root template and the one or more detail templates. 2. The method of claim 1 , wherein the forming the cluster includes: determining a document object model (DOM) tree for each of the plurality of web pages; determining a vector representation for each of the plurality of web pages based on one or more tag paths in the DOM trees; calculating a similarity value based on the vector representation; and forming the cluster based on the similarity value. 3. The method of claim 2 , wherein the calculating the similarity value is based on a cosine similarity measure. 4. The method of claim 1 , wherein the similar characteristic is located in a body of at least some of the plurality of web pages. 5. The method of claim 1 , wherein the forming the cluster includes employing a K-means algorithm. 6. The method of claim 1 , further comprising determining a list of tag path text items for each of the plurality of web pages, wherein the one or more occurrence vectors and the one or more position vectors are determined based at least in part on the determined lists of tag path text items. 7. The method of claim 1 , wherein the structured information includes comparison information. 8. The method of claim 1 , wherein the determining the one or more detail templates is recursive. 9. A system comprising: one or more processors; a memory; and a web knowledge extraction component stored in the memory and executable by the one or more processors to: extract one or more occurrence vectors and one or more position vectors for data included in a plurality of pages; determine a root template based on the one or more occurrence vectors; determine one or more detail templates based on the one or more position vectors, wherein determining the one or more detail templates comprises: dividing document object model (DOM) trees of the plurality of pages of the cluster into a plurality of blocks based at least in part on a plurality of tag path text items present in the root template; detecting one or more patterns in each block of the plurality of blocks to identify additional data fields; identifying an additional data field of the additional data fields to be one of a first type or a second type in a block of the plurality of blocks, the identifying enabling the block to be split into a plurality of new blocks, wherein the first type comprises a data field located after a tag path text item which occurs once in at least a predetermined number of pages and the second type comprises a data field following an equivalence class that occurs one or more times and for which no other tag path is present; and repeating the detecting and the identifying for the plurality of new blocks; and provide structured information for the plurality of pages based on the root template and the one or more detail templates. 10. The system of claim 9 , further comprising a clustering component stored in the memory and executable by the one or more processors to identify one or more clusters of pages from the plurality of pages. 11. The system of claim 10 , wherein identifying the one or more clusters is based at least in part on similarity of at least one tag path text item of the plurality of pages. 12. The system of claim 10 , wherein the clustering component further operates to determine whether each of the one or more clusters is larger than a minimum number of pages. 13. The system of claim 9 , wherein the plurality of pages are semi-structured web pages. 14. One or more computer-readable storage media, storing instructions that enable a processor to perform actions comprising: automatically transforming data of a plurality of semi-structured pages, to determine one or more occurrence vectors and one or more position vectors; determining at least one root template based at least in part on the one or more occurrence vectors; determining at least one detail template based at least in part on the one or more position vectors, wherein determining the at least one detail template comprises: dividing document object model (DOM) trees of the plurality of semi-structured pages of the cluster into a plurality of blocks based at least in part on a plurality of tag path text items present in the root template; detecting one or more patterns in each block of the plurality of blocks to identify additional data fields; identifying an additional data field of the additional data fields to be one of a first type or a second type in a block of the plurality of blocks, the identifying enabling the block to be split into a plurality of new blocks, wherein the first type comprises a data field located after a tag path text item which occurs once in at least a predetermined number of semi-structured pages and the second type comprises a data field following an equivalence class that occurs one or more times and for which no other tag path is present; and repeating the detecting and the identifying for the plurality of new blocks; and extracting structured information from the at least one root template and the at least one detail template. 15. The one or more computer-readable storage media of claim 14 , wherein the actions further comprise storing the structured information in a knowledge base. 16. The one or more computer-readable storage media of claim 14 , wherein the actions further comprise providing the structured information in response to a search query based at least in part on a determination of at least one intent indicated by the search query. 17. The one or more computer-readable storage media of claim 14 , wherein the actions further comprise clustering the plurality of semi-structured pages based at least in part on at least one similarity of at least some of the plurality of semi-structured pages. 18. The one or more computer-readable storage media of claim 17 , wherein the at least one similarity is in tag path text data of at l

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • Search customisation based on user profiles and personalisation · CPC title

  • G06F16/337Primary

    Profile generation, learning or modification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9020947B2 cover?
Techniques are described for generating structured information from semi-structured web pages, and retrieving the structured knowledge in response to a user query that indicates a query intent. The structured information is automatically extracted offline from semi-structured web pages, through the use of an auto wrapper solution that is noise tolerant, scalable, and automatic. The structured i…
Who is the assignee on this patent?
Yan Jun, Ji Lei, Liu Ning, and 2 more
What technology area does this patent fall under?
Primary CPC classification G06F17/30702. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 28 2015 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).