Robust method to find layout similarity between two documents

US9235758B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9235758-B1
Application numberUS-201414318890-A
CountryUS
Kind codeB1
Filing dateJun 30, 2014
Priority dateJun 30, 2014
Publication dateJan 12, 2016
Grant dateJan 12, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for comparing documents may be provided. For example, a comparison between layouts of the documents may be performed. The comparison may include segmenting the documents into blocks, where an arrangement of blocks of a document represents a layout of the document. Once segmented, similarity metrics, such as distances, between blocks of one document and blocks of the other document may be computed. The similarity metrics may be used to match the blocks between the documents. Further, the similarity metrics between the matched blocks may be added to determine an overall similarity metric between the documents. This overall similarity metric may indicate how similar the documents may be.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method comprising: segmenting a first document into first blocks and a second document into second blocks, each of the first blocks and second blocks representing a respective contiguous document portion comprising information associated by a relation; computing, by a processor, distances between the first blocks and the second blocks, wherein the distances are indicative of similarities between the first blocks and the second blocks; matching the first blocks to the second blocks to minimize the distances; and computing, by the processor, a document distance between the first document and the second document based on the distances between the matched first blocks and the second blocks, wherein the document distance is indicative of a similarity between the first document and the second document. 2. The computer-implemented method of claim 1 further comprising: identifying a block type for each of the first blocks and second blocks; and computing the distances between first blocks and second blocks by only computing distances between blocks having a same respective block type. 3. The computer-implemented method of claim 1 , wherein computing the distances between the first blocks and the second blocks comprises calculating one or more distances: (a) between centers of a first block and a second block; (b) between corner points of the first block and the second block; (c) based on differences in heights of the first block and the second block; (d) based on differences in widths of the first block and the second block; (e) based on differences in areas of the first block and the second block; and (f) an amount of overlap between the first block and the second block. 4. The computer-implemented method of claim 1 , wherein computing the distances between the first blocks and the second blocks comprises: selecting properties of the first blocks and second blocks from available properties; and computing the distances based on the selected properties, wherein the distances comprise a first distance computed based on a first property and a second distance computed based on a second property. 5. The computer-implemented method of claim 1 , wherein computing the document distance between the first document and the second document comprises summing the distances between the first blocks and the second blocks based on weights associated with the distances. 6. The computer-implemented method of claim 1 , wherein computing the distances between the first blocks and the second blocks comprises comparing blocks based on block type, the comparing comprising: comparing first blocks of a text block type only with second blocks of the text block type; and comparing first blocks of a non-text block type only with second blocks of the non-text block type. 7. The computer-implemented method of claim 1 , wherein matching the first blocks to the second blocks comprises: matching text blocks from the first blocks and the second blocks based on a first cost assignment, wherein the first cost assignment minimizes a total distance between the text blocks; matching non-text blocks from the first blocks and second blocks based on a second cost assignment, wherein the second cost assignment minimizes a total distance between the non-text blocks; and computing the distances between the first blocks and the second blocks based on individual distances between the matched text blocks and individual distances between the matched non-text blocks. 8. The computer-implemented method of claim 1 , wherein matching the first blocks to the second blocks comprises: matching text blocks from the first blocks and the second blocks; matching non-text blocks from the first blocks and second blocks; determining that text blocks from the first blocks remain; determining that non-text blocks from the second blocks remain unmatched; matching the unmatched text blocks to the unmatched non-text blocks based on a cost assignment, wherein the cost assignment minimizes a total distance between the unmatched text blocks and the unmatched non-text blocks; and computing the distances between the first blocks and the second blocks based on individual distances between the matched text blocks, individual distances between the matched non-text blocks, and individual distances associated with the matching of the unmatched text blocks and unmatched non-text blocks. 9. The computer-implemented method of claim 1 , wherein matching the first blocks to the second blocks comprises: matching text blocks from the first blocks and the second blocks; matching non-text blocks from the first blocks and second blocks; computing a penalty for unmatched text blocks or non-text blocks; and computing the distances between the first blocks and the second blocks based on individual distances between the matched text blocks, individual distances between the matched non-text blocks, and the penalty. 10. The computer-implemented method of claim 1 , matching the first blocks to the second blocks comprises: matching text blocks from the first blocks and the second blocks; matching non-text blocks from the first blocks and second blocks; determining that text blocks and non-text blocks from the first blocks remain unmatched; computing a first penalty for the unmatched text blocks and a second penalty for the unmatched non-text blocks; and computing the distances between the first blocks and the second blocks based on individual distances between the matched text blocks, individual distances between the matched non-text blocks, and the first penalty and the second penalty. 11. The computer-implemented method of claim 1 , wherein segmenting the first document into the first blocks comprise: identifying lines of text in the first document; setting a portion of the first document that contains a subset of less than all of the lines of text as a paragraph block based on spacing between the lines of text. 12. The computer-implemented method of claim 1 , wherein segmenting the first document into the first blocks comprises setting a portion of the first document that contains an image as an image block. 13. A system comprising: a processor; a memory communicatively coupled to the processor and bearing instructions that, upon execution by the processor, cause the system to at least: divide a document into blocks, wherein the blocks indicate a layout of the document; compute similarity metrics of the blocks relative to blocks of another document, wherein the similarity metrics are based on properties of the blocks; match the blocks of the document to the blocks of the other document based on the similarity metric; and compute a document similarity metric of the document relative to the other document based on the similarity metrics of the matched blocks, wherein the document similarity metric indicates a similarity between the layout of the document and a layout of the other document. 14. The system of claim 13 , wherein the blocks have a geometric shape, and wherein the properties are based on the geometric shape and comprise one or more of: centers, heights, widths, areas, or amounts of overlap of the blocks. 15. The system of claim 13 , wherein the instructions, upon execution by the processor, further cause the processor to: compare the similarity metric to a threshold; and determine whether the document and the other document have similar layouts based on the comparison. 16. The system of claim 15 , wherein the instructions, upon execution by the processor, further ca

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9235758B1 cover?
Techniques for comparing documents may be provided. For example, a comparison between layouts of the documents may be performed. The comparison may include segmenting the documents into blocks, where an arrangement of blocks of a document represents a layout of the document. Once segmented, similarity metrics, such as distances, between blocks of one document and blocks of the other document ma…
Who is the assignee on this patent?
Adobe Systems Inc
What technology area does this patent fall under?
Primary CPC classification G06K9/00463. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 12 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).