Textual analysis system for automatic content extaction

US10545928B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10545928-B2
Application numberUS-201214009027-A
CountryUS
Kind codeB2
Filing dateMar 29, 2012
Priority dateMar 30, 2011
Publication dateJan 28, 2020
Grant dateJan 28, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present invention provides a method, and an associated apparatus configured to implement such a method, for analysing mark-up language text content, such as might be found on a website or within online user generated content. The method comprises a training phase, in which plurality of schemas are automatically generated from a specified text and a final schema is compiled. This final schema can then be used to compare with other online text content such that content which matched the final schema can be identified, for example for further analysis and comparison.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of automatically extracting content from a data resource, the data resource comprising a plurality of hierarchical levels, each of the plurality of hierarchical levels comprising content defined using mark-up language and the method comprising a training phase and a content extraction phase, wherein the entirety of the training phase precedes the content extraction phase so that the content extraction phase can only begin when the entirety of the training phase has been completed; the training phase comprising the steps of: i) defining one or more hierarchical levels of interest; ii) defining an entity which is comprised within the one or more hierarchical levels of interest defined in step i) and one or more properties associated with that entity; and for said entity and the associated entity properties defined in step ii), executing a plurality of training instances, wherein each of the training instances comprises the steps of: a) defining a value for each of the one or more properties associated with said entity; b) for each of the property values, determining a containing element which provides a match to the property value and storing the containing element in an instance schema, the instance schema being associated with said entity; and iii) comparing each of a plurality of instance schemas associated with said entity to generate a final schema for said entity; and iv) storing the one or more final schemas in a composite schema which is associated with the data resource; the content extraction phase comprising the steps of: I) comparing a data resource from which content is to be extracted with the composite schema; II) identifying entities and their associated properties within the data resource which match the containing elements specified in the composite schema; and III) extracting those entities and their associated properties identified in step II) from the data resource; wherein the completion of the entirety of the training phase results in the generation of the composite schema, and the content extraction phase cannot begin without that generated composite schema. 2. A method according to claim 1 wherein in step iii) the first instance schema to be generated for an entity is retained and is assigned an occurrence count value of 1. 3. A method according to claim 1 wherein in step iii) if there is no adequate match between a first instance schema and a second instance schema then the second instance schema will be retained and is assigned an occurrence count value of 1. 4. A method according to claim 1 wherein in step iii) if a first instance schema is identical to a second instance schema then the occurrence count of the first instance schema will be incremented and the second instance schema will be discarded. 5. A method according to claim 1 , wherein in step iii), a derived instance schema is created by merging a first instance schema with a second instance schema. 6. A method according to claim 5 , wherein a derived instance schema is created by merging a first instance schema with a second instance schema if there is an adequate degree of similarity between the first and second schemas. 7. A method according to claim 6 wherein a derived instance schema is created by merging a first instance schema with a second instance schema if the first and second instance schema comprise: a) a common start-tag; b) identical sub-element hierarchies; and c) an equal number of property elements comprised within the sub-element hierarchies. 8. A method according to claim 7 , wherein the predetermined threshold value is 60%. 9. A method according to claim 5 , wherein the first derived instance schema to be generated for an entity is retained and is assigned an occurrence count value of 1. 10. A method according to claim 5 , wherein if a first derived instance schema is identical to a second derived instance schema then the occurrence count of the first derived instance schema will be incremented and the second derived instance schema will be discarded. 11. A method according to claim 1 , wherein step iii) comprises the step of determining which of the plurality of instance schemas and derived instance schemas has an occurrence frequency which exceeds a predetermined threshold value. 12. A method according to claim 1 in which three or more training instances are executed for each of the entities. 13. A non-transitory data carrier for use in a computing device, the data carrier comprising computer executable code which, in use, performs a method of automatically extracting content from a data resource, the data resource comprising a plurality of hierarchical levels, each of the plurality of hierarchical levels comprising content defined using mark-up language and the method comprising a training phase and a content extraction phase, wherein the entirety of the training phase precedes the content extraction phase so that the content extraction phase can only begin when the entirety of the training phase has been completed; the training phase comprising the steps of: i) defining one or more hierarchical levels of interest; ii) defining an entity which is comprised within the one or more hierarchical levels of interest defined in step i) and one or more properties associated with that entity; and for said entity and the associated entity properties defined in step ii), executing a plurality of training instances, wherein each of the training instances comprises the steps of: a) defining a value for each of the one or more properties associated with said entity; b) for each of the property values, determining a containing element which provides a match to the property value and storing the containing element in an instance schema, the instance schema being associated with said entity; and iii) comparing each of a plurality of instance schemas associated with said entity to generate a final schema for said entity; and iv) storing the one or more final schemas in a composite schema which is associated with the data resource; the content extraction phase comprising the steps of: I) comparing a data resource from which content is to be extracted with the composite schema; II) identifying entities and their associated properties within the data resource which match the containing elements specified in the composite schema; and III) extracting those entities and their associated properties identified in step II) from the data resource; wherein the completion of the entirety of the training phase results in the generation of the composite schema, and the content extraction phase cannot begin without that generated composite schema. 14. An apparatus comprising one or more central processing units, one or more data storage means and a network interface, the apparatus, in use, being configured to perform automatically extracting content from a data resource, the data resource comprising a plurality of hierarchical levels, each of the plurality of hierarchical levels comprising content defined using mark-up language and the extracting comprising a training phase and a content extraction phase, wherein the entirety of the training phase precedes the content extraction phase so that the content extraction phase can only begin when the entirety of the training phase has been completed; the training phase comprising the steps of: i) defining one or more hierarchical levels of interest; ii) defining an entity which is comprised within the one or more hierarchical levels of interest defined in step i) and one or more properties associated with that entity; and for said entity and the as

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10545928B2 cover?
The present invention provides a method, and an associated apparatus configured to implement such a method, for analysing mark-up language text content, such as might be found on a website or within online user generated content. The method comprises a training phase, in which plurality of schemas are automatically generated from a specified text and a final schema is compiled. This final schem…
Who is the assignee on this patent?
Gharib Hamid, Thompson Simon, Nguyen Duong, and 2 more
What technology area does this patent fall under?
Primary CPC classification G06F40/216. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 28 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).