Detecting webpages that share malicious content
US-2019190946-A1 · Jun 20, 2019 · US
US2021105298A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2021105298-A1 |
| Application number | US-202017020232-A |
| Country | US |
| Kind code | A1 |
| Filing date | Sep 14, 2020 |
| Priority date | Dec 20, 2017 |
| Publication date | Apr 8, 2021 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and systems for detecting webpages that share malicious content are presented. A first set of webpages that hosts a web account checker is identified. A baseline page structure score and a baseline language score are calculated based on the identified first set of webpages. Content from a second set of webpages is collected and analyzed based on the calculated baseline page structure and the calculated baseline language scores. One or more of the second set of webpages is flagged as malicious based on the analyzing of the content collected from the second set of webpages.
Opening claim text (preview).
1 . (canceled) 2 . A system for detecting malicious activity on webpages comprising: a non-transitory memory; and one or more hardware processors coupled to the non-transitory memory and configured to read instructions from the non-transitory memory to cause the system to perform operations comprising: categorizing each of a first set of webpages as a host of a respective online phishing instrument; calculating a baseline page structure score and a baseline language score based on the identified first set of webpages hosting the respective online phishing instruments; collecting content from a second set of webpages for analysis; analyzing the content collected from the second set of webpages using a machine learning classifier trained based on the calculated baseline page structure and the calculated baseline language scores; and flagging one or more of the second set of webpages as malicious based on the analyzing of the content collected from the second set of webpages using the machine learning classifier. 3 . The system of claim 2 , wherein the categorizing each of the first set of webpages as the host of the respective online phishing instrument comprises searching a plurality of webpages for predefined terms known to be used in webpages associated with hosts of online phishing instruments, wherein each of the first set of webpages is categorized as the host of the respective online phishing instrument based on at least one of the predefined terms being found to be included on a respective webpage of the first set of webpages. 4 . The system of claim 2 , wherein the baseline page structure score is calculated based on Hypertext Markup Language (HTML) feature elements discovered in the first set of webpages. 5 . The system of claim 2 , wherein the baseline language score is calculated based on terms identified from the respective host of the online phishing instrument. 6 . The system of claim 2 , wherein collecting content from the second set of webpages for analysis comprises using a web crawler to identify website content that requires analysis, the identifying being based on searching for terms found within a dictionary of pre-computed terms associated with hosts of online phishing instruments. 7 . The system of claim 6 , wherein the collecting the content from the second set of webpages for analysis further comprises filtering out results from an approved list of domains and advertisements. 8 . The system of claim 2 , wherein analyzing the content collected from the second set of webpages using the machine learning classifier comprises: determining that the collected content does not fit into the known malicious classification by calculating a page structure score based on the baseline page structure score, and calculating a language score based on the baseline page structure score. 9 . A method comprising: categorizing each of a first set of webpages as a host of a respective online phishing instrument; calculating a baseline page structure score and a baseline language score based on the identified first set of webpages hosting the respective online phishing instruments; collecting content from a second set of webpages for analysis; analyzing the content collected from the second set of webpages using a machine learning classifier trained based on the calculated baseline page structure and the calculated baseline language scores; and flagging one or more of the second set of webpages as malicious based on the analyzing of the content collected from the second set of webpages using the machine learning classifier. 10 . The method of claim 9 , wherein the categorizing each of the first set of webpages as the host of the respective online phishing instrument comprises searching a plurality of webpages for predefined terms known to be used in webpages associated with hosts of online phishing instruments, wherein each of the first set of webpages is categorized as the host of the respective online phishing instrument based on at least one of the predefined terms being found to be included on a respective webpage of the first set of webpages. 11 . The method of claim 9 , wherein the baseline page structure score is calculated based on Hypertext Markup Language (HTML) feature elements discovered in the first set of webpages. 12 . The method of claim 9 , wherein the baseline language score is calculated based on terms identified from the respective host of the online phishing instrument. 13 . The method of claim 9 , wherein collecting content from the second set of webpages for analysis comprises using a web crawler to identify website content that requires analysis, the identifying being based on searching for terms found within a dictionary of pre-computed terms associated with hosts of online phishing instruments. 14 . The method of claim 13 , wherein the collecting the content from the second set of webpages for analysis further comprises filtering out results from an approved list of domains and advertisements. 15 . The method of claim 9 , wherein analyzing the content collected from the second set of webpages using the machine learning classifier comprises: determining that the collected content does not fit into the known malicious classification by calculating a page structure score based on the baseline page structure score, and calculating a language score based on the baseline page structure score. 16 . A non-transitory machine-readable medium having stored thereon machine-readable instructions executable to cause performance of operations comprising: categorizing each of a first set of webpages as a host of a respective online phishing instrument; calculating a baseline page structure score and a baseline language score based on the identified first set of webpages hosting the respective online phishing instruments; collecting content from a second set of webpages for analysis; analyzing the content collected from the second set of webpages using a machine learning classifier trained based on the calculated baseline page structure and the calculated baseline language scores; and flagging one or more of the second set of webpages as malicious based on the analyzing of the content collected from the second set of webpages using the machine learning classifier. 17 . The non-transitory machine-readable medium of claim 16 , wherein the categorizing each of the first set of webpages as the host of the respective online phishing instrument comprises searching a plurality of webpages for predefined terms known to be used in webpages associated with hosts of online phishing instruments, wherein each of the first set of webpages is categorized as the host of the respective online phishing instrument based on at least one of the predefined terms being found to be included on a respective webpage of the first set of webpages. 18 . The non-transitory machine-readable medium of claim 16 , wherein the baseline page structure score is calculated based on Hypertext Markup Language (HTML) feature elements discovered in the first set of webpages. 19 . The non-transitory machine-readable medium of claim 16 , wherein the baseline language score is calculated based on terms identified from the respective host of the online phishing instrument. 20 . The non-transitory machine-readable medium of claim 16 , wherein collecting content from the second set of webpages for analysis comprises using a web crawler to identify website content that requires analysis, the identifying being
Protocols · CPC title
Tree-structured documents (parsing G06F40/205; validation G06F40/226) · CPC title
Markup, e.g. Standard Generalized Markup Language [SGML] or Document Type Definition [DTD] · CPC title
based on web technology, e.g. hypertext transfer protocol [HTTP] · CPC title
Countermeasures against malicious traffic (countermeasures against attacks on cryptographic mechanisms H04L9/002) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.