Domain Classification And Routing Using Lexical and Semantic Processing
US-2016352772-A1 · Dec 1, 2016 · US
US2017126723A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2017126723-A1 |
| Application number | US-201615275303-A |
| Country | US |
| Kind code | A1 |
| Filing date | Sep 23, 2016 |
| Priority date | Oct 30, 2015 |
| Publication date | May 4, 2017 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
The present invention provides a method and device for identifying URL legitimacy. Through obtaining a URL to be identified, and then obtaining, based on the URL to be identified, a legitimate URL corresponding to the URL to be identified as a comparison object, and calculating a degree of similarity between the URL to be identified and the comparison object, the present invention makes it possible to identify the legitimacy of the URL to be identified based on the degree of similarity, enabling timely discovering of illegitimate URLs and thus improving the safety of information processing.
Opening claim text (preview).
We claim: 1 . A method for identifying URL legitimacy, wherein the method comprises: obtaining a URL to be identified; obtaining, based on the URL to be identified, a legitimate URL corresponding to the URL to be identified as a comparison object; calculating a degree of similarity between the URL to be identified and the comparison object; identifying the legitimacy of the URL to be identified based on the degree of similarity. 2 . The method according to claim 1 , wherein the step of obtaining, based on the URL to be identified, a legitimate URL corresponding to the URL to be identified as a comparison object comprises: obtaining, based on the URL to be identified and an inverted index of legitimate URLs, a legitimate URL corresponding to the URL to be identified as the comparison object. 3 . The method according to claim 2 , wherein, the method comprises, before the step of obtaining, based on the URL to be identified and an inverted index of legitimate URLs, a legitimate URL corresponding to the URL to be identified as the comparison object, the following: collecting at least one legitimate URL; carrying out word segmentation on each of the legitimate URLs of the at least one legitimate URL with a N-Gram model, so as to obtain a segmentation result; obtaining the inverted index of legitimate URLs based on each of the legitimate URLs and the segmentation result of each of the legitimate URLs. 4 . The method according to claim 3 , wherein, the step of carrying out a word segmentation on each of the legitimate URLs of the at least one legitimate URL with a N-Gram model, so as to obtain a segmentation result comprises: obtaining the domain name of each of the legitimate URLs based each of the legitimate URLs; removing the prefix and suffix of the domain name of each of the legitimate URLs, so as to obtain an essential word of each of the legitimate URLs; carrying out word segmentation on the essential word of each of the legitimate URLs with a N-Gram model, so as to obtain a segmentation result. 5 . The method according to claim 1 , wherein, the step of identifying the legitimacy of the URL to be identified based on the degree of similarity comprises: identifying the URL to be identified as a legitimate URL if the degree of similarity is equal to 1 and the suffix of the URL to be identified is consistent with the suffix of the comparison object; or identifying the URL to be identified as a suspected illegitimate URL if the degree of similarity is equal to 1 and the suffix of the URL to be identified is inconsistent with the suffix of the comparison object; or identifying the URL to be identified as an illegitimate URL if the degree of similarity is greater than or equal to a first threshold value and less than 1; identifying the URL to be identified as a suspected illegitimate URL if the degree of similarity is greater than or equal to a second threshold value and less than the first threshold value, wherein the second threshold value is less than the first threshold value; identifying the URL to be identified as a legitimate URL if the degree of similarity is less than the second threshold value or equal to 1. 6 . The method according to claim 5 , wherein, before the step of identifying the legitimacy of the URL to be identified based on the degree of similarity, the method further comprises: carrying out legitimacy identification processing on at least one sample URL with the at least one legitimate URL, so as to obtain an identification result; obtaining the first threshold value and the second threshold value based on the identification result and a labeling result of each of the sample URLs of the at least one sample URL. 7 . The method according to claim 1 , wherein after the step of identifying the legitimacy of the URL to be identified based on the degree of similarity, the method further comprises: sending the identification result to a terminal so that: the terminal displays the identification result; and/or the terminal allows or prohibits, based on the identification result, executing access operations based on the URL to be identified. 8 . A nonvolatile computer storage medium, stored with one or more programs, which, when executed by an apparatus, make the apparatus to execute the following operation: obtaining a URL to be identified; obtaining, based on the URL to be identified, a legitimate URL corresponding to the URL to be identified as a comparison object; calculating a degree of similarity between the URL to be identified and the comparison object; identifying the legitimacy of the URL to be identified based on the degree of similarity. 9 . The nonvolatile computer storage medium according to claim 8 , wherein the operation of obtaining, based on the URL to be identified, a legitimate URL corresponding to the URL to be identified as a comparison object comprises: obtaining, based on the URL to be identified and an inverted index of legitimate URLs, a legitimate URL corresponding to the URL to be identified as the comparison object. 10 . The nonvolatile computer storage medium according to claim 9 , wherein, before the operation of obtaining, based on the URL to be identified and an inverted index of legitimate URLs, a legitimate URL corresponding to the URL to be identified as the comparison object, the one or more programs make the apparatus to further execute the following operation: collecting at least one legitimate URL; carry out word segmentation on each of the legitimate URLs of the at least one legitimate URL with a N-Gram model, so as to obtain a segmentation result; obtaining the inverted index of legitimate URLs based on each of the legitimate URLs and the segmentation result of each of the legitimate URLs. 11 . The nonvolatile computer storage medium according to claim 10 , wherein the operation of carrying out a word segmentation on each of the legitimate URLs of the at least one legitimate URL with a N-Gram model, so as to obtain a segmentation result comprises: obtaining the domain name of each of the legitimate URLs based each of the legitimate URLs; removing the prefix and suffix of the domain name of each of the legitimate URLs, so as to obtain an essential word of each of the legitimate URLs; carrying out word segmentation on the essential word of each of the legitimate URLs with a N-Gram model, so as to obtain a segmentation result. 12 . The nonvolatile computer storage medium according to claim 8 , wherein, the operation of identifying the legitimacy of the URL to be identified based on the degree of similarity comprises: identifying the URL to be identified as a legitimate URL if the degree of similarity is equal to 1 and the suffix of the URL to be identified is consistent with the suffix of the comparison object; or identifying the URL to be identified as a suspected illegitimate URL if the degree of similarity is equal to 1 and the suffix of the URL to be identified is inconsistent with the suffix of the comparison object; or identifying the URL to be identified as an illegitimate URL if the degree of similarity is greater than or equal to a first threshold value and less than 1; identifying the URL to be identified as a suspected illegitimate URL if the degree of similarity is greater than or equal to a second threshold value and less than the first threshold value, wherein the second threshold value is less than the first threshold value; identifying the URL to be identified as a legitimate URL if the degree of similarity is less than the second threshold value or equal to 1. 13 . The nonvolatile computer storage medium according to
Comparing digital values (G06F7/06, {G06F7/22,} G06F7/38 take precedence) · CPC title
Parsing · CPC title
Traffic logging, e.g. anomaly detection · CPC title
above the transport layer · CPC title
Entity profiles · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.