Detecting phishing PDFs with an image-based deep learning approach

US12348560B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12348560-B2
Application numberUS-202217734956-A
CountryUS
Kind codeB2
Filing dateMay 2, 2022
Priority dateApr 25, 2022
Publication dateJul 1, 2025
Grant dateJul 1, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The detection of phishing Portable Document Format (PDF) files using an image-based deep learning approach is disclosed. A PDF document that includes a Universal Resource Locator is received. A likelihood that the received PDF document represents a phishing threat is determined, at least in part, by using an image based model. A verdict for the PDF document is provided as output based at least in part on the determined likelihood.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: a processor configured to: receive a Portable Document Format (PDF) document in response to a determination having been made that the PDF document includes at least one clickable link to a Uniform Resource Locator (URL); determine a likelihood that the received PDF document represents a phishing threat, at least in part, using an image-based model that was previously trained, at least in part, using a plurality of images that were generated using one or more tools that collectively convert a set of PDF given document files to the plurality of images, wherein at least one given document file has a ground truth label of being a phishing PDF; and provide as output a verdict for the PDF document based at least in part on the determined likelihood, wherein the verdict is usable by a security appliance to take a remedial action associated with the received PDF document; and a memory coupled to the processor and configured to provide the processor with instructions. 2. The system of claim 1 , wherein the processor is further configured to determine whether the PDF document includes at least one the clickable link. 3. The system of claim 1 , wherein the verdict is that the received PDF document is benign. 4. The system of claim 1 , wherein the verdict is that the received PDF document does not represent a phishing threat. 5. The system of claim 1 , wherein determining the likelihood includes converting at least one page of the received PDF document into an image. 6. The system of claim 1 , wherein at least some of the images labeled as phishing PDFs belong, collectively, to a multi-page PDF document. 7. The system of claim 1 , wherein, prior to training the image-based model, an image hash-based filtering operation is performed on at least some of the images labeled as phishing PDFs. 8. The system of claim 7 , wherein filtered images are stored using a TFRecord data format. 9. The system of claim 1 , wherein the processor is further configured to generate the image-based model. 10. The system of claim 1 , wherein the image-based model is a convolutional neural network model. 11. The system of claim 1 , wherein, at least in part in response to receiving an indication of a false positive result, the image-based model is retrained using a benign data set that includes the false positive result. 12. A method, comprising: receiving a Portable Document Format (PDF) document in response to a determination having been made that the PDF document includes at least one clickable link to a Uniform Resource Locator (URL); determining a likelihood that the received PDF document represents a phishing threat, at least in part, using an image-based model that was previously trained, at least in part, using a plurality of images that were generated using one or more tools that collectively convert a set of PDF given document files to the plurality of images, wherein at least one given document file has a ground truth label of being a phishing PDF; and providing as output a verdict for the PDF document based at least in part on the determined likelihood, wherein the verdict is usable by a security appliance to take a remedial action associated with the received PDF document. 13. The method of claim 12 , further comprising determining whether the PDF document includes the at least one clickable link. 14. The method of claim 12 , wherein the verdict is that the received PDF document is benign. 15. The method of claim 12 , wherein the verdict is that the received PDF document does not represent a phishing threat. 16. The method of claim 12 , wherein determining the likelihood includes converting at least one page of the received PDF document into an image. 17. The method of claim 12 , wherein at least some of the images labeled as phishing PDFs belong, collectively, to a multi-page PDF document. 18. The method of claim 12 , wherein, prior to training the image-based model, an image hash-based filtering operation is performed on at least some of the images labeled as phishing PDFs. 19. The method of claim 18 , wherein filtered images are stored using a TFRecord data format. 20. The method of claim 12 , further comprising generating the image-based model. 21. The method of claim 12 , wherein the image-based model is a convolutional neural network model. 22. The method of claim 12 , wherein, at least in part in response to receiving an indication of a false positive result, the image-based model is retrained using a benign data set that includes the false positive result. 23. A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions for: receiving a Portable Document Format (PDF) document in response to a determination having been made that the PDF document includes at least one clickable link to a Uniform Resource Locator (URL); determining a likelihood that the received PDF document represents a phishing threat, at least in part, using an image-based model that was previously trained, at least in part, using a plurality of images that were generated using one or more tools that collectively convert a set of PDF given document files to the plurality of images, wherein at least one given document file has a ground truth label of being a phishing PDF; and providing as output a verdict for the PDF document based at least in part on the determined likelihood, wherein the verdict is usable by a security appliance to take a remedial action associated with the received PDF document.

Assignees

Inventors

Classifications

  • G06F21/577Primary

    Assessing vulnerabilities and evaluating computer system security · CPC title

  • service impersonation, e.g. phishing, pharming or web spoofing (detection of rogue wireless access points H04W12/12) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12348560B2 cover?
The detection of phishing Portable Document Format (PDF) files using an image-based deep learning approach is disclosed. A PDF document that includes a Universal Resource Locator is received. A likelihood that the received PDF document represents a phishing threat is determined, at least in part, by using an image based model. A verdict for the PDF document is provided as output based at least …
Who is the assignee on this patent?
Palo Alto Networks Inc
What technology area does this patent fall under?
Primary CPC classification G06F21/577. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jul 01 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).