Determining string similarity using syntactic edit distance

US2016294852A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016294852-A1
Application numberUS-201514679757-A
CountryUS
Kind codeA1
Filing dateApr 6, 2015
Priority dateApr 6, 2015
Publication dateOct 6, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Examples relate to determining string similarity using syntactic edit distance. In one example, a computing device may receive domain name system (DNS) packets that were sent by a client device, each DNS packet specifying a domain name; generate, for each domain name, a syntax string by replacing each character of the domain name with one of a plurality of metacharacters, each metacharacter representing a category of characters that is different from each other category of characters represented by each other metacharacter; determine, for each domain name, a syntactic edit distance between the domain name and each other domain name, the syntactic edit distance between domain names being determined based on syntax strings of the corresponding domain names; cluster each domain name into one of a plurality of clusters based on the syntactic edit distances; and identify the client device as a potential source of malicious software based on the clusters.

First claim

Opening claim text (preview).

1 . A non-transitory machine-readable storage medium encoded with instructions executable by a hardware processor of a computing device for determining string similarity, the machine-readable storage medium comprising instructions to cause the hardware processor to: receive domain name system (DNS) query packets that were sent by a particular client computing device, each DNS query packet specifying a query domain name; generate, for each query domain name included in the received DNS query packets, a syntax string by replacing each character of the query domain name with one of a plurality of metacharacters, each of the plurality of metacharacters representing a category of characters that is different from each other category of characters represented by each other metacharacter in the plurality of metacharacters; determine, for each query domain name included in the received DNS query packets, a syntactic edit distance between the query domain name and each other query domain name included in the received DNS packets, the syntactic edit distance between query domain names being determined based on syntax strings of the corresponding domain names; cluster each query domain name included in the received DNS query packets into one of a plurality of clusters based on the syntactic edit distances; and identify the particular client computing device as a potential source of malicious software based on the plurality of clusters. 2 . The storage medium of claim 1 , wherein the instructions further cause the processor to: generate, for each syntax string, a sorted syntax string by sorting the metacharacters of each syntax string, and wherein the syntactic edit distance between query domain names is determined based on the sorted syntax strings of the corresponding domain names. 3 . The storage medium of claim 1 , wherein each syntactic edit distance between query domain names is determined based on an edit distance between syntax strings of the corresponding query domain names. 4 . The storage medium of claim 1 , wherein the particular client computing device is identified as a potential source of malicious software in response to determining that one of the plurality of clusters includes a number of query domain names that exceeds a threshold number of query domain names. 5 . The storage medium of claim 1 , wherein at least one category of characters represented by one of the plurality of metacharacters includes at least one of: alphabetical letters; lower-case letters; upper-case letters; vowel letters; consonant letters; foreign language characters; digits; punctuation marks; dashes; periods; underscores; or unprintable characters. 6 . A computing device for determining string similarity, the computing device comprising: a hardware processor; and a data storage device storing instructions that, when executed by the hardware processor, cause the hardware processor to: obtain, from at least one network egress point of a network, domain name system (DNS) query packets that were sent by at least one computing device operating on the network, each DNS query packet specifying a query domain name; generate, for each query domain name included in the DNS query packets, a syntax string by replacing a subset of the characters of the query domain name with one of a plurality of metacharacters, each of the plurality of metacharacters representing a category of characters that is different from each other category of characters represented by each other metacharacter in the plurality of metacharacters; determine, for each query domain name, a syntactic edit distance between the query domain name and each other query domain name included in the DNS query packets, the syntactic edit distance between the query domain name and each other domain name being determined based on the syntax string of the query domain name and each syntax string of each other domain name; cluster each of the query domain names into one of a plurality of domain name clusters based on the syntactic edit distances between the query domain names; and determine, based on the plurality of domain name clusters, use of a domain name generation algorithm by the at least one computing device operating on the network. 7 . The system of claim 6 wherein the instructions further cause the processor to: generate, for each syntax string, a sorted syntax string by sorting the metacharacters of each syntax string, and wherein the syntactic edit distance between query domain names is determined by: calculating an edit distance between sorted syntax strings of the corresponding domain names. 8 . The system of claim 6 , wherein each syntactic edit distance between query domain names is determined by: calculating an edit distance between syntax strings of the corresponding query domain names. 9 . The system of claim 8 , wherein the instructions further cause the processor to: determine, for each query domain name, a measure of similarity to each other query domain name, each measure of similarity being determined between a first domain name and a second domain name by: determining an edit distance between the first query domain name and the second query domain name; and calculating the measure of similarity between the first query domain name and the second query domain name based on the edit distance and the syntactic edit distance. 10 . The system of claim 6 , wherein use of the domain name generation algorithm is determined based on a number of query domain names in a particular cluster of the plurality of clusters relative to other numbers of query domain names in each of the other clusters of the plurality of clusters. 11 . A computer-implemented method for determining string similarity, implemented by a hardware processor, the method comprising executing on the hardware processor the steps of: receiving over a computer network a first string of characters and a second string of characters from domain name system (DNS) query packets originating from a particular computing device, the second string of characters being different from the first string of characters; generating a first syntax string by replacing each character of the first string with one of a plurality of metacharacters, each of the plurality of metacharacters representing a category of characters that is different from each other category of characters represented by each other metacharacter in the plurality of metacharacters; generating a second syntax string by replacing each character of the second string with one of the plurality of metacharacters; and generating network anomaly data for the particular computing device by determining a measure of similarity between the first string and the second string using a syntactic edit distance between the first string and the second string, the syntactic edit distance between first string and the second string being determined based on the first syntax string and second syntax string. 12 . The method of claim 11 , further comprising: identifying the particular computing device as a potential source of malicious software based on the measure of similarity between the first string and the second string. 13 . The method of claim 11 , further comprising: receiving a plurality of additional strings of characters originating from the particular computing device; generating, for each additional string, an additional syntax string by replacing each character of the additional string with one of the plurality of metacharacters; and determining, for each additional string, an additional measure of similarity between the additional string and each of

Assignees

Inventors

Classifications

  • Electricity · mapped topic

  • Event detection, e.g. attack signature detection · CPC title

  • using domain name system [DNS] · CPC title

  • Traffic logging, e.g. anomaly detection · CPC title

  • Name conversion · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016294852A1 cover?
Examples relate to determining string similarity using syntactic edit distance. In one example, a computing device may receive domain name system (DNS) packets that were sent by a client device, each DNS packet specifying a domain name; generate, for each domain name, a syntax string by replacing each character of the domain name with one of a plurality of metacharacters, each metacharacter rep…
Who is the assignee on this patent?
Trend Micro Inc
What technology area does this patent fall under?
Primary CPC classification H04L63/1416. Mapped technology areas include Electricity.
When was this patent published?
Publication date Thu Oct 06 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).