System of generating new schema based on selective HTML elements

US9460231B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9460231-B2
Application numberUS-201113637483-A
CountryUS
Kind codeB2
Filing dateMar 28, 2011
Priority dateMar 26, 2010
Publication dateOct 4, 2016
Grant dateOct 4, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present invention provides a system which is able to detect similar web page elements which are described in mark-up language, such that the content of those elements can be captured. Text content may then be sent to a text classifier for further analysis.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of automatically generating a mark-up language schema, the method comprising the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema. 2. A method as claimed in claim 1 , wherein the or each training sample comprises a uniform resource locator. 3. A method as claimed in claim 1 , wherein the or each training sample further comprises a text sequence. 4. A method of analysing mark-up language text, the method comprising the steps of: i) applying a mark-up language schema to an online data resource, the mark-up language schema comprising a plurality of mark-up language elements; ii) identifying one or more data elements comprised within the online data resource, the or each data elements being associated with a particular mark-up language element; and iii) extracting those data elements identified in step ii), wherein the mark-up language schema is generated using a method in accordance with claim 1 . 5. A non-transitory computer readable story medium storing computer executable code for performing a method according to claim 1 . 6. An apparatus for generating a mark-up language schema, the apparatus comprising: a processing system including one or more processors and one or more storage memories, the processing system being configured to perform at least the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema. 7. An apparatus for analysing mark-up language text, the apparatus comprising: a processing system including one or more processors and one or more storage memories, the processing system being configured to perform at least the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema; f) applying a mark-up language schema to an online data resource, the mark-up language schema comprising a plurality of mark-up language elements; g) identifying one or more data elements comprised within the online data resource, the or each data elements being associated with a particular mark-up language element; and h) extracting those data elements identified in step g); wherein the mark-up language schema is generated using steps a)-e). 8. A apparatus as claimed in claim 6 , wherein the or each training sample comprises a uniform resource locator. 9. A apparatus as claimed in claim 6 , wherein the or each training sample further comprises a text sequence.

Assignees

Inventors

Classifications

  • Information retrieval; Database structures therefor; File system structures therefor · CPC title

  • G06F16/80Primary

    of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML (content-based retrieval of web data G06F16/95) · CPC title

  • Digital computing or data processing equipment or methods, specially adapted for specific functions (information retrieval, database structures or file system structures therefor G06F16/00) · CPC title

  • Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9460231B2 cover?
The present invention provides a system which is able to detect similar web page elements which are described in mark-up language, such that the content of those elements can be captured. Text content may then be sent to a text classifier for further analysis.
Who is the assignee on this patent?
Thompson Simon G, Nguyen Duong T, Thint Marcus Alfred, and 2 more
What technology area does this patent fall under?
Primary CPC classification G06F16/80. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 04 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).