Method and device for identifying url legitimacy
US-2017126723-A1 · May 4, 2017 · US
US2019190946A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2019190946-A1 |
| Application number | US-201715849395-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 20, 2017 |
| Priority date | Dec 20, 2017 |
| Publication date | Jun 20, 2019 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and systems for detecting webpages that share malicious content are presented. A first set of webpages that hosts a web account checker is identified. A baseline page structure score and a baseline language score are calculated based on the identified first set of webpages. Content from a second set of webpages is collected and analyzed based on the calculated baseline page structure and the calculated baseline language scores. One or more of the second set of webpages is flagged as malicious based on the analyzing of the content collected from the second set of webpages.
Opening claim text (preview).
What is claimed is: 1 . A system for detecting malicious activity on webpages comprising: a non-transitory memory; and one or more hardware processors coupled to the non-transitory memory and configured to read instructions from the non-transitory memory to cause the system to perform operations comprising: identifying a first set of webpages that hosts a web account checker; calculating a baseline page structure score based on the identified first set of webpages; calculating a baseline language score based on the identified first set of webpages; collecting content from a second set of webpages for analysis; analyzing the content collected from the second set of webpages based on the calculated baseline page structure and the calculated baseline language scores; and flagging one or more of the second set of webpages as malicious based on the analyzing of the content collected from the second set of webpages. 2 . The system of claim 1 , wherein the first set of webpages that hosts the web account checker is identified by searching for specific terms listed in a dictionary of predefined terms associated with web account checkers. 3 . The system of claim 1 , wherein the baseline page structure score is calculated based on Hypertext Markup Language (HTML) feature elements discovered in the first set of webpages. 4 . The system of claim 1 , wherein the baseline language score is calculated based on terms identified from known web account checkers. 5 . The system of claim 1 , wherein collecting content from the second set of webpages for analysis comprises using a web crawler to identify website content that requires analysis, the identifying being based on searching for terms found within a dictionary of pre-computed terms associated with web account checkers. 6 . The system of claim 5 , wherein collecting content from the second set of webpages for analysis further comprises filtering out results from whitelisted domains and advertisements. 7 . The system of claim 1 , wherein analyzing the content collected from the second set of webpages comprises: determining whether the collected content fits into a known malicious classification; and in response to a determination that the collected content does not fit into a known malicious classification: calculating a page structure score based on the baseline page structure score, and calculating a language score based on the baseline page structure score. 8 . The system of claim 7 , wherein one or more of the second set of webpages is flagged as malicious in response to determining that at least one of the calculated page structure score or the calculated language score exceed a predetermined threshold. 9 . A method comprising: performing a precomputation by calculating a baseline page structure score and a baseline language score based on a first set of webpages identified as webpages that host web account checkers; collecting content from a second set of webpages for analysis; analyzing the content collected from the second set of webpages by calculating a page structure score based on the baseline page structure score, and calculating a language score based on the baseline language score; determining whether at least one of the calculated page structure score or the calculated language score exceeds a predetermined threshold; and flagging one or more of the second set of webpages as malicious in response to determining that at least one of the calculated page structure score or the calculated language score exceeds the predetermined threshold. 10 . The method of claim 9 , further comprising identifying the first set of webpages as hosts of web account checkers by performing a search for specific terms listed in a dictionary of predefined terms associated with web account checkers. 11 . The method of claim 10 , wherein an initial set of terms found in the dictionary are manually input. 12 . The method of claim 11 further comprising providing feedback based on the analyzing of the collected content, wherein the corpus of the specific terms in the dictionary increases based on reinforced learning from the analyzing. 13 . The method of claim 10 , wherein the precomputation that is performed generates and adaptive fingerprint on which the analyzing of the collected content is based. 14 . The method of claim 10 , wherein the baseline page structure score and the baseline language score is continuously recalculated based on machine learning. 15 . The method of claim 10 , wherein collecting content from the second set of webpages for analysis further comprises identifying whitelisted domains and filtering results of the collecting based on the identified whitelisted domains. 16 . A non-transitory machine-readable medium having stored thereon machine-readable instructions executable to cause performance of operations comprising: identifying a first set of webpages that hosts a web account checker; calculating a baseline page structure score based on the identified first set of webpages; calculating a baseline language score based on the identified first set of webpages; collecting content from a second set of webpages for analysis; analyzing the content collected from the second set of webpages based on the calculated baseline page structure and the calculated baseline language scores; and flagging one or more of the second set of webpages as malicious based on the analyzing of the content collected from the second set of webpages. 17 . The non-transitory machine-readable medium of claim 16 , wherein collecting content from the second set of webpages for analysis comprises using a web crawler to identify website content that requires analysis, the identifying being based on searching for terms found within a dictionary of pre-computed terms associated with web account checkers. 18 . The non-transitory machine-readable medium of claim 16 , wherein analyzing the content collected from the second set of webpages comprises: determining whether the collected content fits into a known malicious classification; and in response to a determination that the collected content does not fit into a known malicious classification: calculating a page structure score based on the baseline page structure score, and calculating a language score based on the baseline page structure score. 19 . The non-transitory machine-readable medium of claim 17 , wherein one or more of the second set of webpages is flagged as malicious in response to determining that at least one of the calculated page structure score or the calculated language score exceed a predetermined threshold. 20 . The non-transitory machine-readable medium of claim 16 , wherein the baseline page structure score and the baseline language score is continuously recalculated based on machine learning.
specially adapted for terminals or networks with limited capabilities; specially adapted for terminal portability · CPC title
Dictionaries · CPC title
based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO] · CPC title
Machine learning · CPC title
Countermeasures against malicious traffic (countermeasures against attacks on cryptographic mechanisms H04L9/002) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.