Method and apparatus for grouping documents based on high-level features clustering

US2019005038A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2019005038-A1
Application numberUS-201715639541-A
CountryUS
Kind codeA1
Filing dateJun 30, 2017
Priority dateJun 30, 2017
Publication dateJan 3, 2019
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and apparatus for creating a file directory of documents in a database that are clustered based on one or more high level features are disclosed. For example, the method includes identifying the one or more high level features for each one of a plurality of documents stored in the database, comparing the one or more high level features of the each one of the plurality of documents to other documents of the plurality of documents, grouping documents of the plurality of documents into a plurality of clusters based on common high level features that are identified in the comparing and creating the file directory of documents in the database based on the plurality of clusters.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for creating a file directory of documents in a database that are clustered based on one or more high level features, comprising: identifying, by a processor, the one or more high level features for each one of a plurality of documents stored in the database; comparing, by the processor, the one or more high level features of the each one of the plurality of documents to other documents of the plurality of documents; grouping, by the processor, documents of the plurality of documents into a plurality of clusters based on common high level features that are identified in the comparing; and creating, by the processor, the file directory of documents in the database based on the plurality of clusters. 2 . The method of claim 1 , wherein the one or more high level features comprises a spot title, an address field, a margin icon, a table, a border area, or a text flow. 3 . The method of claim 1 , wherein the one or more high level features are identified based on a predefined set of rules. 4 . The method of claim 3 , wherein the predefined set of rules comprises a size of a feature and a location of the feature relative to an origin. 5 . The method of claim 4 , wherein the origin comprises a top left corner of the document. 6 . The method of claim 3 , wherein the one or more high level features comprise a pre-defined priority level. 7 . The method of claim 6 , wherein a feature comprising two different rules of the predefined set of rules is identified based on the pre-defined priority level. 8 . The method of claim 1 , wherein the identifying and the comparing is performed based on only a first page the each one of the plurality of documents. 9 . The method of claim 1 , wherein the documents in each one of the plurality of clusters share a same number of different high level features. 10 . A non-transitory computer-readable medium storing a plurality of instructions, which when executed by a processor, cause the processor to perform operations for creating a file directory of documents in a database that are clustered based on one or more high level features, the operations comprising: identifying the one or more high level features for each one of a plurality of documents stored in the database; comparing the one or more high level features of the each one of the plurality of documents to other documents of the plurality of documents; grouping documents of the plurality of documents into a plurality of clusters based on common high level features that are identified in the comparing; and creating the file directory of documents in the database based on the plurality of clusters. 11 . The non-transitory computer-readable medium of claim 10 , wherein the one or more high level features comprises a spot title, an address field, a margin icon, a table, a border area, or a text flow. 12 . The non-transitory computer-readable medium of claim 10 , wherein the one or more high level features are identified based on a predefined set of rules. 13 . The non-transitory computer-readable medium of claim 12 , wherein the predefined set of rules comprises a size of a feature and a location of the feature relative to an origin. 14 . The non-transitory computer-readable medium of claim 13 , wherein the origin comprises a top left corner of the document. 15 . The non-transitory computer-readable medium of claim 12 , wherein the one or more high level features comprise a pre-defined priority level. 16 . The non-transitory computer-readable medium of claim 15 , wherein a feature comprising two different rules of the predefined set of rules is identified based on the pre-defined priority level. 17 . The non-transitory computer-readable medium of claim 10 , wherein the identifying and the comparing is performed based on only a first page the each one of the plurality of documents. 18 . The non-transitory computer-readable medium of claim 10 , wherein the documents in each one of the plurality of clusters share a same number of different high level features. 19 . A method for creating a file directory of documents in a database that are clustered based on one or more high level features, comprising: scanning, by a processor, a plurality of segments of each one of a plurality of documents stored in the database, wherein the plurality segments have a predefined size; comparing, by the processor, images in each one of the plurality of segments to a plurality of predefined rules, wherein each one of the plurality of predefined rules is associated with a different high level feature; identifying, by the processor, the one or more high level features based on the comparing for the each one of a plurality of documents; comparing, by the processor, the one or more high level features of the each one of the plurality of documents to other documents of the plurality of documents; grouping, by the processor, documents of the plurality of documents into a plurality of clusters, wherein the documents in each one of the plurality of clusters share a same number of different high level features that are identified based on the comparing; and creating, by the processor, the file directory of documents in the database based on the plurality of clusters. 20 . The method of claim 19 , wherein the one or more high level features comprises a spot title, an address filed, a margin icon, a table, a border area, or a text flow.

Assignees

Inventors

Classifications

  • using shape and object relationship · CPC title

  • G06F16/51Primary

    Indexing; Data structures therefor; Storage structures · CPC title

  • G06F16/93Primary

    Document management systems · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2019005038A1 cover?
A method and apparatus for creating a file directory of documents in a database that are clustered based on one or more high level features are disclosed. For example, the method includes identifying the one or more high level features for each one of a plurality of documents stored in the database, comparing the one or more high level features of the each one of the plurality of documents to o…
Who is the assignee on this patent?
Xerox Corp
What technology area does this patent fall under?
Primary CPC classification G06F16/51. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 03 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).