Identification of host names generated by a domain generation algorithm

US9756063B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9756063-B1
Application numberUS-201414553879-A
CountryUS
Kind codeB1
Filing dateNov 25, 2014
Priority dateNov 25, 2014
Publication dateSep 5, 2017
Grant dateSep 5, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Host name raw data from access logs of computers is grouped into distinct groups. At least one feature, an alphanumeric or alphabetic-only digest, is extracted from each group and its characters are ordered depending upon their frequency of use. Sampling is performed upon host names from a database of known normal host names to generate groups of randomly selected host names. Similar digests are also extracted from these groups. The digest from the raw data is compared to each of the digests from the normal host names using a string matching algorithm to determine a value. If the value is above a threshold then it is likely that the host names from the raw data group are domain-generated. The suspect host names are used to reference the raw data access log in order to determine which user computers have accessed these host names and these user computers are alerted.

First claim

Opening claim text (preview).

I claim: 1. A method of detecting host names generated by a domain generation algorithm, said method comprising: grouping a suspect set of host names obtained from a raw access log of an endpoint computer into a plurality of distinct suspect groups by at least one of a destination IP address and a sub-parent domain, wherein said raw access log reflects Web sites accessed by said endpoint computer over an access period of time and identifies said endpoint computer; extracting, from one of said suspect groups, a suspect alphanumeric digest string in which characters are ordered by frequency of use within said one suspect group; grouping a normal set of host names known to not have been generated randomly into a plurality of distinct normal groups wherein said host names in said normal set were generated by humans; for each of said normal groups, extracting a normal alphanumeric digest string in which characters are ordered by frequency of use within said each normal group; calculating a distance measure between said suspect alphanumeric digest string and said normal alphanumeric digest strings from said normal groups; determining that said one suspect group includes host names generated by a domain generation algorithm, indicative of an opportunity for the endpoint computer to be compromised by malicious software, when said distance measure is above a threshold; identifying said endpoint computer as having accessed host names of said one suspect group; and determining that said endpoint computer has accessed at least a predetermined number of host names from said one suspect group in a predetermined time period and outputting an indication that said endpoint computer has been compromised by said malicious software. 2. The method as recited in claim 1 wherein said suspect alphanumeric digest strings and said normal alphanumeric digest strings do not include numerals. 3. The method as recited in claim 1 further comprising: for each of said suspect groups, extracting a suspect alphabetic digest string that does not include numerals and in which characters are ordered by frequency of use within said each suspect group; for each of said normal groups, extracting a normal alphabetic digest string that does not include numerals and in which characters are ordered by frequency of use within said each normal group; and calculating an alphabetic distance measure between a suspect alphabetic digest string from one of said suspect groups and said normal alphabetic digest strings from said normal groups. 4. A method of detecting host names generated by a domain generation algorithm, said method comprising: grouping a suspect set of host names obtained from raw access logs from a plurality of computers into a plurality of distinct suspect groups by at least one of a destination IP address and a sub-parent domain, wherein each of said computers is an endpoint computer and wherein each of said raw access log reflects Web sites accessed by said endpoint computers over an access period of time and identifies said endpoint computers; for each of said suspect groups, extracting a suspect alphanumeric digest string in which characters are ordered by frequency of use within said each suspect group; grouping a normal set of host names known to not have been generated randomly into a plurality of distinct normal groups; for each of said normal groups, extracting a normal alphanumeric digest string in which characters are ordered by frequency of use within said each normal group; calculating a distance measure between a suspect alphanumeric digest string from one of said suspect groups and said normal alphanumeric digest strings from said normal groups; determining that said one suspect group includes host names generated by a domain generation algorithm when said distance measure is above a threshold; identifying one of said computers as having accessed host names of said one suspect group; and determining that one of said computers has accessed at least a predetermined number of host names from said one suspect group in a predetermined time period and outputting an indication that said one computer has been compromised by malicious software. 5. The method as recited in claim 4 , further comprising: cross-referencing said at least one host name generated by a domain generation algorithm with said raw access logs in order to output an identification of one of said computers that has accessed said at least one host name. 6. The method as recited in claim 4 wherein said suspect alphanumeric digest strings and said normal alphanumeric digest strings do not include numerals. 7. The method as recited in claim 4 further comprising: for each of said suspect groups, extracting a suspect alphabetic digest string that does not include numerals and in which characters are ordered by frequency of use within said each suspect group; for each of said normal groups, extracting a normal alphabetic digest string that does not include numerals and in which characters are ordered by frequency of use within said each normal group; and calculating an alphabetic distance measure between a suspect alphabetic digest string from one of said suspect groups and said normal alphabetic digest strings from said normal groups. 8. A method of detecting host names generated by a domain generation algorithm, said method comprising: accessing sample groups of host names from a database of host names, with each of said host names known to not have been generated randomly such that the host names represent a candidate data set of non-malicious host names; for each of said sample groups, extracting a normal alphanumeric digest string in which characters are ordered by frequency of use within said each normal group; grouping a suspect set of host names obtained from a raw access log of a computer into a plurality of distinct suspect groups, wherein said computer is an endpoint computer and wherein said raw access log reflects Web sites accessed by said endpoint computer over an access period of time and identifies said endpoint computer; extracting, from one of said suspect groups, a suspect alphanumeric digest string in which characters are ordered by frequency of use within said suspect group; calculating a distance measure between said suspect alphanumeric digest string and said normal alphanumeric digest strings from said sample groups; determining that said one suspect group includes host names generated by a domain generation algorithm when said distance measure is above a threshold; identifying said computer as having accessed host names of said one suspect group using said raw access log of said endpoint computer; and determining that said computer has accessed at least a predetermined number of host names from said one suspect group in a predetermined time period and outputting an indication that said computer has been compromised by malicious software. 9. The method as recited in claim 8 wherein said suspect alphanumeric digest strings and said normal alphanumeric digest strings do not include numerals. 10. The method as recited in claim 8 further comprising: grouping said suspect set of host names by an IP address of each of said host names or by a sub-parent domain of each of said host names. 11. The method as recited in claim 8 further comprising: for each of said suspect groups, extracting a suspect alphabetic digest string that does not include numerals and in which characters are ordered by frequency of use within said each suspect group; for each of said sample groups, extracting a normal alphabetic digest string that does not include numerals and in which characters are ordered by frequency of use

Assignees

Inventors

Classifications

  • Processing captured monitoring data, e.g. for logfile generation · CPC title

  • Traffic logging, e.g. anomaly detection · CPC title

  • involving long-term monitoring or reporting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9756063B1 cover?
Host name raw data from access logs of computers is grouped into distinct groups. At least one feature, an alphanumeric or alphabetic-only digest, is extracted from each group and its characters are ordered depending upon their frequency of use. Sampling is performed upon host names from a database of known normal host names to generate groups of randomly selected host names. Similar digests ar…
Who is the assignee on this patent?
Chung Yueh Hsuan, Trend Micro Inc
What technology area does this patent fall under?
Primary CPC classification H04L63/1425. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Sep 05 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).