Content metadata directory services
US-9218429-B2 · Dec 22, 2015 · US
US9460231B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9460231-B2 |
| Application number | US-201113637483-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 28, 2011 |
| Priority date | Mar 26, 2010 |
| Publication date | Oct 4, 2016 |
| Grant date | Oct 4, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present invention provides a system which is able to detect similar web page elements which are described in mark-up language, such that the content of those elements can be captured. Text content may then be sent to a text classifier for further analysis.
Opening claim text (preview).
What is claimed is: 1. A method of automatically generating a mark-up language schema, the method comprising the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema. 2. A method as claimed in claim 1 , wherein the or each training sample comprises a uniform resource locator. 3. A method as claimed in claim 1 , wherein the or each training sample further comprises a text sequence. 4. A method of analysing mark-up language text, the method comprising the steps of: i) applying a mark-up language schema to an online data resource, the mark-up language schema comprising a plurality of mark-up language elements; ii) identifying one or more data elements comprised within the online data resource, the or each data elements being associated with a particular mark-up language element; and iii) extracting those data elements identified in step ii), wherein the mark-up language schema is generated using a method in accordance with claim 1 . 5. A non-transitory computer readable story medium storing computer executable code for performing a method according to claim 1 . 6. An apparatus for generating a mark-up language schema, the apparatus comprising: a processing system including one or more processors and one or more storage memories, the processing system being configured to perform at least the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema. 7. An apparatus for analysing mark-up language text, the apparatus comprising: a processing system including one or more processors and one or more storage memories, the processing system being configured to perform at least the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema; f) applying a mark-up language schema to an online data resource, the mark-up language schema comprising a plurality of mark-up language elements; g) identifying one or more data elements comprised within the online data resource, the or each data elements being associated with a particular mark-up language element; and h) extracting those data elements identified in step g); wherein the mark-up language schema is generated using steps a)-e). 8. A apparatus as claimed in claim 6 , wherein the or each training sample comprises a uniform resource locator. 9. A apparatus as claimed in claim 6 , wherein the or each training sample further comprises a text sequence.
Information retrieval; Database structures therefor; File system structures therefor · CPC title
of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML (content-based retrieval of web data G06F16/95) · CPC title
Digital computing or data processing equipment or methods, specially adapted for specific functions (information retrieval, database structures or file system structures therefor G06F16/00) · CPC title
Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.