Identifying artifacts in digital documents

US10949604B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10949604-B1
Application numberUS-201916664335-A
CountryUS
Kind codeB1
Filing dateOct 25, 2019
Priority dateOct 25, 2019
Publication dateMar 16, 2021
Grant dateMar 16, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques described herein implement identifying artifacts in digital documents in a digital medium environment. A document analysis system is leveraged to extract page features from a digital document and to determine whether certain page features represent page artifacts such as headers and footers. Those page features determined to be page artifacts can be extracted from the digital document to generate a reflowed version of the digital document that preserves primary content. The primary content, for instance, is rearranged in the reflowed document to compensate for the extracted page artifacts.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for generating a reflowed digital document, the system comprising: one or more processors; and one or more computer-readable storage media storing instructions that are executable by the one or more processors to perform operations including: extracting page features from pages of the digital document, the page features including a first content type and a second content type; determining feature scores for the page features, each feature score indicating a likelihood that a respective page feature represents the second content type on a page of the digital document; sorting the page features by applying their respective feature scores to a first likelihood threshold and a second likelihood threshold to identify: a set of high confidence features with feature scores above the first likelihood threshold that include a first page feature most likely to correspond to the second content type of the digital document; set of mid confidence features that include a second page feature with a feature score between the first threshold and the second threshold; and a set of low confidence features that include a third page feature with a feature score lower than the second threshold; defining an artifact region of one or more pages of the digital document based on a page position of the first page feature; determining that a page position of the second page feature spatially coincides with the defined artifact region on at least one page of the digital document and identifying the second page feature as the second content type; and extracting the first page feature and the second page feature from the digital document to generate the reflowed digital document that includes the first content type. 2. A system as described in claim 1 , wherein said determining the feature scores comprises inputting the page features into a machine learning model, and receiving the feature scores as output from the machine learning model. 3. A system as described in claim 1 , wherein said operations further include disregarding the third page feature as part of identifying the second content type of the digital document. 4. A system as described in claim 1 , wherein the operations further include defining the artifact region based on determining that the digital document includes a threshold number of pages with an instance of the second content type located at the artifact region. 5. A system as described in claim 1 , wherein the operations further include defining the artifact region based on determining that the artifact region includes an instance of the second content type that spans more than a threshold amount of a width of a page of the digital document. 6. A system as described in claim 1 , wherein the second page feature comprises at least one of a header or a footer of the digital document, and wherein the operations further include reflowing the digital document to generate the reflowed document by repositioning the primary content to account for the extracted second page feature. 7. A system as described in claim 1 , wherein the instructions are executable by the one or more processors to perform the operations in response to receiving a selection of a selectable control displayed along with the digital document. 8. A system as described in claim 7 , wherein the selectable control is selectable to toggle back and forth between the digital document and the reflowed document. 9. A method implemented by at least one computing device for generating a reflowed version of a digital document, the method comprising: determining, by the at least one computing device, feature scores for page features of the digital document, the page features including a first content type and a second content type, and each feature score indicating a likelihood that a respective page feature represents the second content type of the digital document; sorting, by the at least one computing device, the page features by applying their respective feature scores to a first likelihood threshold and a second likelihood threshold to identify a first set of page features with feature scores that exceed the first likelihood threshold, a second set of page features with feature scores between the first likelihood threshold and the second likelihood threshold, and a third set of page features with feature scores that are below the second likelihood threshold; defining, by the at least one computing device, an artifact region on at least one page of the digital document based on a position of a page feature of the first set of page features; determining, by the at least one computing device, that a page feature from the second set of pages features represents the second content type by determining that a position of the page feature from the second set of page features spatially coincides with a position of the artifact region; and generating, by the at least one computing device, a reflowed version of the digital document by extracting the page feature of the first set of page features and the page feature of the second set of page features and preserving the first content type of the digital document. 10. A method as described in claim 9 , wherein the third set of page features is disregarded for identifying instances of the second content type of the digital document. 11. A method as described in claim 9 , wherein said defining the artifact region is further based on determining that the page feature of the first set of page features spans more than a threshold amount of a width of a page of the digital document. 12. A method as described in claim 9 , wherein the page feature from the second set of page features comprises at least one of a header or a footer of the digital document. 13. A method as described in claim 9 , wherein said defining the artifact region is further based on determining, by the at least one computing device, that the page feature of the first set of page feature occurs on one or more of a threshold number of pages of the digital document, or a threshold percentage of pages of the digital document. 14. A method as described in claim 9 , wherein said generating further comprises rearranging, by the at least one computing device, the first content type to account for the extracted page features. 15. A method implemented by at least one computing device for generating a reflowed version of a digital document, the method comprising: extracting page features from pages of the digital document, the page features including a first content type and a second content type; determining feature scores for the page features, each feature score indicating a likelihood that a respective page feature represents the second content type on a page of the digital document; sorting the page features by applying their respective feature scores to a first likelihood threshold and a second likelihood threshold to identify: a set of high confidence features with feature scores above the first likelihood threshold that include a first page feature most likely to correspond to the second content type of the digital document; set of mid confidence features that include a second page feature with a feature score between the first threshold and the second threshold; and a set of low confidence features that include a third page feature with a feature score lower than the second threshold; defining an artifact region of one or more pages of the digital document based on a page position of the first page feature; determining that a page position of the second page feature spatially coincides with the defined artifact region on

Assignees

Inventors

Classifications

  • Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors · CPC title

  • removing elements interfering with the pattern to be recognised · CPC title

  • Editing text-bitmaps, e.g. alignment, spacing; Semantic analysis of bitmaps of text without OCR · CPC title

  • Classification of content, e.g. text, photographs or tables · CPC title

  • Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10949604B1 cover?
Techniques described herein implement identifying artifacts in digital documents in a digital medium environment. A document analysis system is leveraged to extract page features from a digital document and to determine whether certain page features represent page artifacts such as headers and footers. Those page features determined to be page artifacts can be extracted from the digital documen…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/114. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).