Identifying malware communications with DGA generated domains by discriminative learning

US9781139B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9781139-B2
Application numberUS-201514806236-A
CountryUS
Kind codeB2
Filing dateJul 22, 2015
Priority dateJul 22, 2015
Publication dateOct 3, 2017
Grant dateOct 3, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are presented to identify malware communication with domain generation algorithm (DGA) generated domains. Sample domain names are obtained and labeled as DGA domains, non-DGA domains or suspicious domains. A classifier is trained in a first stage based on the sample domain names. Sample proxy logs including proxy logs of DGA domains and proxy logs of non-DGA domains are obtained to train the classifier in a second stage based on the plurality of sample domain names and the plurality of sample proxy logs. Live traffic proxy logs are obtained and the classifier is tested by classifying the live traffic proxy logs as DGA proxy logs, and the classifier is forwarded to a second computing device to identify network communication of a third computing device as malware network communication with DGA domains via a network interface unit of the third computing device based on the trained and tested classifier.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method comprising: at a first computing device, obtaining a plurality of sample domain names and labeling each of the plurality of sample domain names as a domain generation algorithm (DGA) domain, a non-DGA domain or a suspicious domain; training a classifier in a first stage based on the plurality of sample domain names without a proxy log; obtaining a plurality of sample proxy logs including proxy logs of DGA domains and proxy logs of non-DGA domains; training the classifier in a second stage based on the plurality of sample domain names and the plurality of sample proxy logs; obtaining a plurality of live traffic proxy logs; testing the classifier by classifying the plurality of live traffic proxy logs as DGA proxy logs; and using the trained and tested classifier to identify network communication of a second computing device as malware network communication with DGA domains via a network interface unit of the second computing device. 2. The method of claim 1 , wherein training the classifier in the first stage comprises: obtaining DGA domains and non-DGA domains from whitelists and blacklists; obtaining suspicious domains from a domain contacted by an isolated malicious program; and obtaining unknown DGA domains from output of a Domain Name System anomaly detection process. 3. The method of claim 1 , wherein training the classifier in the second stage comprises: generating an artificial proxy log of a DGA domain by selecting a proxy log of a DGA domain comprising a first domain name from the plurality of sample proxy logs and replacing the first domain name with a second domain name classified as a DGA domain name; and adding the artificial proxy log to the plurality of sample proxy logs. 4. The method of claim 3 , further comprising: storing statistical training data calculated from the plurality of sample proxy logs and the corresponding domain names in a sample database, wherein the statistical data includes features calculated based on the domain names and flow based features; and training the classifier based on the statistical training data. 5. The method of claim 4 , wherein testing the trained classifier comprises: extracting statistical features from the live traffic proxy logs; forming a first input feature vector from the statistical features extracted from the live traffic proxy logs; and generating a test result by applying the trained classifier to the first input feature vector. 6. The method of claim 4 , wherein each of the plurality of sample proxy logs comprises a domain name in the form of a uniform resource locator (URL); and wherein storing the statistical training data comprises: parsing the URL into logical parts; and calculating statistics for each logical part of the URL. 7. The method of claim 4 , further comprising: obtaining proxy logs of the malware network communication; extracting statistical features from the proxy logs of the malware network communication; forming a second input feature vector from the statistical features extracted from the proxy logs of the malware network communication; and identifying the network communication as the malware network communication with DGA domains by applying the trained and tested classifier to the second input feature vector. 8. An apparatus comprising: one or more processors; one or more memory devices in communication with the one or more processors; and a network interface unit coupled to the one or more processors, wherein the one or more processors are configured to: obtain a plurality of sample domain names and labeling each of the plurality of sample domain names as a domain generation algorithm (DGA) domain, a non-DGA domain or a suspicious domain; train a classifier in a first stage based on the plurality of sample domain names without a proxy log; obtain a plurality of sample proxy logs including proxy logs of DGA domains and proxy logs of non-DGA domains; train the classifier in a second stage based on the plurality of sample domain names and the plurality of sample proxy logs; obtain a plurality of live traffic proxy logs; test the classifier by classifying the plurality of live traffic proxy logs as DGA proxy logs; and use the trained and tested classifier to identify network communication of another computing device as malware network communication with DGA domains via a network interface unit of the other computing device. 9. The apparatus of claim 8 , wherein the one or more processors are configured to train the classifier in the first stage by: obtaining DGA domains and non-DGA domains from whitelists and blacklists; obtaining suspicious domains from a domain contacted by an isolated malicious program; and obtaining unknown DGA domains from output of a Domain Name System anomaly detection process. 10. The apparatus of claim 8 , wherein the one or more processors are configured to train the classifier in the second stage by: generating an artificial proxy log of a DGA domain by selecting a proxy log of a DGA domain comprising a first domain name from the plurality of sample proxy logs and replacing the first domain name with a second domain name classified as a DGA domain name; and adding the artificial proxy log to the plurality of sample proxy logs. 11. The apparatus of claim 10 , wherein the one or more processors are configured to: store statistical training data calculated from the plurality of sample proxy logs and the corresponding domain names in a sample database, wherein the statistical data includes features calculated based on the domain names and flow based features; and train the classifier based on the statistical training data. 12. The apparatus of claim 10 , wherein the one or more processors are configured to test the trained classifier by: extracting statistical features from the live traffic proxy logs; forming a first input feature vector from the statistical features extracted from the live traffic proxy logs; and generating a test result by applying the trained classifier to the first input feature vector. 13. The apparatus of claim 10 , wherein each of the plurality of sample proxy logs comprises a domain name in the form of a uniform resource locator (URL), and wherein the one or more processors are configured to store the statistical training data by: parsing the URL into logical parts; and calculating statistics for each logical part of the URL. 14. The apparatus of claim 10 , wherein the one or more processors are configured to: obtain proxy logs of the malware network communication; extract statistical features from the proxy logs of the malware network communication; form a second input feature vector from the statistical features extracted from the proxy logs of the malware network communication; and identify the network communication as the malware network communication with DGA domains by applying the trained and tested classifier to the second input feature vector. 15. One or more computer readable non-transitory storage media encoded with software comprising computer executable instructions that when executed by one or more processors, cause the one or more processors to: obtain a plurality of sample domain names and labeling each of the plurality of sample domain names as a domain generation algorithm (DGA) domain, a non-DGA domain or a suspicious domain; train a classifier in a first stage based on the plurality of sample domain names without a proxy log; obtain a plurality of sample proxy logs including proxy logs of DGA domains and proxy logs of

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9781139B2 cover?
Techniques are presented to identify malware communication with domain generation algorithm (DGA) generated domains. Sample domain names are obtained and labeled as DGA domains, non-DGA domains or suspicious domains. A classifier is trained in a first stage based on the sample domain names. Sample proxy logs including proxy logs of DGA domains and proxy logs of non-DGA domains are obtained to t…
Who is the assignee on this patent?
Cisco Tech Inc
What technology area does this patent fall under?
Primary CPC classification H04L63/1416. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Oct 03 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).