Document analysis for region classification

US10013488B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10013488-B1
Application numberUS-201213627621-A
CountryUS
Kind codeB1
Filing dateSep 26, 2012
Priority dateSep 26, 2012
Publication dateJul 3, 2018
Grant dateJul 3, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A document analysis module analyzes electronic media items and identifies regions and region types for the electronic media items. The document analysis module may use rules, typographical feature sets, and cluster analysis to identify regions and region types. The document analysis module may also receive user input and may use the user input to identify regions and region types. The document analysis module may further use template pages to identify regions and region types.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: applying predetermined initial sets of rules to an electronic media item to identify content regions within the electronic media item, wherein a first content region is identified as a first region type of a plurality of region types based, at least in part, on a conformance of the first content region to a first predetermined set of rules corresponding to the first region type, the first predetermined set of rules specifying a combination of typographical features used to identify the first region type; and for the first region type: identifying a set of regions of the first region type, wherein the first content region and a second content region in the set of regions are associated with the first region type; analyzing the set of regions to determine a first typographical feature set for the first content region; analyzing the set of regions to determine a second typographical feature set for the second content region; performing a cluster analysis to identify a cluster of regions from the set of regions, the cluster comprising the first content region and the second content region, where the first typographical feature set and the second typographical feature set comprise values that are within a threshold of a desired value for the cluster; updating the first predetermined set of rules corresponding to the first region type to account for typographical feature values determined from a centroid of the cluster of regions to generate a first updated set of rules; and applying the first updated set of rules to the electronic media item to identify one or more regions associated with the first region type. 2. The method of claim 1 , wherein the first typographical feature set comprises data indicative of one or more of: font size, line spacing, line length, token spacing, margin size, indentation, or region area. 3. The method of claim 1 , wherein the first region type comprises one or more of: a chapter heading, a graphic, a body text, a header, a footer, a table, a list item, a footnote, a table of contents entry, or an equation. 4. The method of claim 1 , further comprising: receiving user input correcting one or more regions; analyzing the corrected one or more regions to obtain updated typographical feature sets; modifying the first updated set of rules based on the updated typographical feature sets to generate a second updated set of rules; and applying the second updated set of rules to the electronic media item to identify one or more regions associated with the first region type. 5. The method of claim 1 , wherein applying the first updated set of rules comprises: identifying one or more regions that have typographical features that satisfy one or more rules in the first updated set of rules. 6. An apparatus comprising: a processing device to: apply a first predetermined set of rules to an electronic media item to identify a first set of regions, wherein a plurality of regions in the first set of regions is associated with a first region type, the first predetermined set of rules specifying a combination of typographical features used to identify the first region type; analyze one or more regions in the first set of regions to determine a typographical feature set for the one or more regions; perform a cluster analysis to identify a first cluster of regions from the first set of regions, where a first typographical feature set of a first region in the first cluster and a second typographical feature set of a second region in the first cluster comprise values that are within a threshold of a desired value for the first cluster; update the first predetermined set of rules to account for typographical feature values determined from a centroid of the first cluster of regions to generate a first updated set of rules; and apply the first updated set of rules to the electronic media item to identify one or more regions associated with the first region type. 7. The apparatus of claim 6 , wherein the processing device is further to: apply a second predetermined set of rules to the electronic media item to identify a second set of regions, wherein a plurality of regions in the second set of regions is associated with a second region type; analyze one or more regions in the second set of regions to determine an additional typographical feature set for the second set of regions; perform a second cluster analysis based on the additional typographical feature set to identify a second cluster of regions; update the second predetermined set of rules based on the second cluster analysis to generate a second updated set of rules; and apply the second updated set of rules to the electronic media item to identify one or more regions associated with the second region type. 8. The apparatus of claim 7 , wherein the processing device is further to: correct one or more regions in response to a user input to generate a corrected one or more regions; analyze the corrected one or more regions to obtain an updated typographical feature set; modify one or more of the first updated set of rules or the second updated set of rules to reflect the updated typographical feature set to generate a third updated set of rules or a fourth updated set of rules; and apply one or more of the third updated set of rules or the fourth updated set of rules to the electronic media item to identify one or more regions associated with the first region type or the second region type. 9. The apparatus of claim 8 , wherein the user input is indicative of one or more of: corrected sizes for the one or more regions or corrected region types associated with the one or more regions. 10. The apparatus of claim 6 , wherein the processing device is further to: generate a page layout for a first page of the electronic media item; compare the page layout with one or more template page layouts; and update one or more regions in the first page to correspond to the one or more template page layouts. 11. The apparatus of claim 10 , wherein to generate the page layout for the first page, the processing device is further to: perform one or more morphological operations on the first page. 12. The apparatus of claim 6 , wherein the first typographical feature set comprises data indicative of one or more of: font size, line spacing, line length, token spacing, margin size, indentation, or region area. 13. The apparatus of claim 6 , wherein the first region type comprises one or more of: a chapter heading, a graphic, a body text, a header, a footer, a table, a list item, a footnote, a table of contents entry, or an equation. 14. The apparatus of claim 6 , wherein the processing device is further to: iterate the applying, the analyzing, the performing, and the updating, until the first cluster of regions converges. 15. The apparatus of claim 14 , wherein the first cluster of regions converges when the first cluster of regions stops changing between iterations. 16. The apparatus of claim 14 , wherein the first cluster of regions converges when a size of the first cluster of regions is greater than a threshold. 17. A non-transitory computer-readable storage medium storing instructions which, when executed, cause a processing device to: apply a first predetermined set of rules to an electronic media item to identify a first set of regions, wherein a plurality of regions in the first set of regions is associated with a first region type, the first predetermined set of rules specifying a combination of typographical features used to identify the first re

Assignees

Inventors

Classifications

  • Graphical querying, e.g. query-by-region, query-by-sketch, query-by-trajectory, GUIs for designating a person/face/object as a query predicate (end-user interface involving hot spots associated with the video H04N21/4725; end-user interface for selecting a Region of Interest H04N21/4728) · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10013488B1 cover?
A document analysis module analyzes electronic media items and identifies regions and region types for the electronic media items. The document analysis module may use rules, typographical feature sets, and cluster analysis to identify regions and region types. The document analysis module may also receive user input and may use the user input to identify regions and region types. The document …
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/7335. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 03 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).