Who is the assignee on this patent?

Thompson Simon G, Nguyen Duong T, Thint Marcus Alfred, and 2 more

What technology area does this patent fall under?

Primary CPC classification G06F16/80. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Oct 04 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System of generating new schema based on selective HTML elements

US9460231B2 · US · B2

Patent metadata
Field	Value
Publication number	US-9460231-B2
Application number	US-201113637483-A
Country	US
Kind code	B2
Filing date	Mar 28, 2011
Priority date	Mar 26, 2010
Publication date	Oct 4, 2016
Grant date	Oct 4, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present invention provides a system which is able to detect similar web page elements which are described in mark-up language, such that the content of those elements can be captured. Text content may then be sent to a text classifier for further analysis.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of automatically generating a mark-up language schema, the method comprising the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema. 2. A method as claimed in claim 1 , wherein the or each training sample comprises a uniform resource locator. 3. A method as claimed in claim 1 , wherein the or each training sample further comprises a text sequence. 4. A method of analysing mark-up language text, the method comprising the steps of: i) applying a mark-up language schema to an online data resource, the mark-up language schema comprising a plurality of mark-up language elements; ii) identifying one or more data elements comprised within the online data resource, the or each data elements being associated with a particular mark-up language element; and iii) extracting those data elements identified in step ii), wherein the mark-up language schema is generated using a method in accordance with claim 1 . 5. A non-transitory computer readable story medium storing computer executable code for performing a method according to claim 1 . 6. An apparatus for generating a mark-up language schema, the apparatus comprising: a processing system including one or more processors and one or more storage memories, the processing system being configured to perform at least the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema. 7. An apparatus for analysing mark-up language text, the apparatus comprising: a processing system including one or more processors and one or more storage memories, the processing system being configured to perform at least the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema; f) applying a mark-up language schema to an online data resource, the mark-up language schema comprising a plurality of mark-up language elements; g) identifying one or more data elements comprised within the online data resource, the or each data elements being associated with a particular mark-up language element; and h) extracting those data elements identified in step g); wherein the mark-up language schema is generated using steps a)-e). 8. A apparatus as claimed in claim 6 , wherein the or each training sample comprises a uniform resource locator. 9. A apparatus as claimed in claim 6 , wherein the or each training sample further comprises a text sequence.

Assignees

Inventors

Classifications

G06F16/00
Information retrieval; Database structures therefor; File system structures therefor · CPC title
G06F16/80Primary
of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML (content-based retrieval of web data G06F16/95) · CPC title
G06F17/00
Digital computing or data processing equipment or methods, specially adapted for specific functions (information retrieval, database structures or file system structures therefor G06F16/00) · CPC title
G06F40/154
Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets · CPC title
G06F17/227
Physics · mapped topic

Patent family

Related publications grouped by family.

View patent family 42320314

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9460231B2 cover?: The present invention provides a system which is able to detect similar web page elements which are described in mark-up language, such that the content of those elements can be captured. Text content may then be sent to a text classifier for further analysis.
Who is the assignee on this patent?: Thompson Simon G, Nguyen Duong T, Thint Marcus Alfred, and 2 more
What technology area does this patent fall under?: Primary CPC classification G06F16/80. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Oct 04 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).