Classifying potentially malicious and benign software modules through similarity analysis

US9998484B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9998484-B1
Application numberUS-201615082731-A
CountryUS
Kind codeB1
Filing dateMar 28, 2016
Priority dateMar 28, 2016
Publication dateJun 12, 2018
Grant dateJun 12, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method comprises obtaining at least a first software module not classified as benign or potentially malicious, extracting a set of features associated with the first software module including static, behavior and context features, computing distance metrics between the extracted feature set and feature sets of a plurality of clusters including one or more clusters of software modules previously classified as benign and exhibiting a first threshold level of similarity relative to one another and one or more clusters of software modules previously classified as potentially malicious and exhibiting a second threshold level of similarity relative to one another, classifying the first software module as belonging to a given cluster based at least in part on the computed distance metrics, and modifying access by a given client device to the first software module responsive to the given cluster being a cluster of software modules previously classified as potentially malicious.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: obtaining at least a first software module not classified as benign or potentially malicious; extracting a set of features associated with the first software module, the set of features comprising one or more static features, one or more behavior features and one or more context features; computing distance metrics between the extracted feature set of the first software module and feature sets of a plurality of clusters, the plurality of clusters comprising: one or more clusters of software modules previously classified as benign and exhibiting a first threshold level of similarity relative to one another; and one or more clusters of software modules previously classified as potentially malicious and exhibiting a second threshold level of similarity relative to one another; classifying the first software module as belonging to a given one of the plurality of clusters based at least in part on the computed distance metrics; and modifying access by a given client device to the first software module responsive to the given cluster being one of the one or more clusters of software modules previously classified as potentially malicious; wherein the method is performed by at least one processing device comprising a processor coupled to a memory; and wherein extracting the set of features associated with the first software module comprises: extracting the one or more static features from the first software module; and obtaining the one or more behavior features and the one or more context features of the first software module from at least one of a plurality of client devices on which the first software module is installed. 2. The method of claim 1 wherein the processing device comprises a network security system configured to communicate with a plurality of client devices, including the given client device, over at least one network. 3. The method of claim 1 wherein the first software module comprises one of: an executable module; and a dynamic link library module. 4. The method of claim 1 wherein the one or more static features comprise: one or more descriptive features; one or more numerical features; and one or more binary features. 5. The method of claim 1 wherein the one or more behavior features comprise: one or more file system access features; one or more process access features; and one or more network connection features. 6. The method of claim 1 wherein the one or more context features comprise: one or more file system path features; one or more path of destination events features; one or more file metadata features; and one or more auto-start functionality features. 7. The method of claim 1 wherein computing the distance metrics comprises one or more of: utilizing a normalized edit distance for respective ones of the extracted features represented as string values; utilizing a Jaccard distance for respective ones of the extracted features represented as sets; utilizing a normalized L1 distance for respective ones of the extracted features represented as real or integer values; and utilizing binary distance for respective ones of the extracted features represented as binary values. 8. The method of claim 1 wherein computing the distance metrics comprises assigning weights to distance between the extracted features, the weights being proportional to entropies of the extracted features in previously classified software modules in the plurality of clusters. 9. The method of claim 1 wherein computing the distance metrics comprises assigning a penalty value to distances between features missing from the extracted feature set of the first software module. 10. The method of claim 1 further comprising determining the plurality of clusters by computing pairwise distances for pairs of previously classified software modules utilizing a density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm. 11. The method of claim 10 wherein computing the pairwise distances comprises: indexing the previously classified software modules based on a subset of the extracted features, the subset of extracted features being numerical features; building a range query based on the indexing; utilizing the range query with a first threshold to retrieve a subset of the previously classified software modules in a neighborhood of a given one of the previously classified software modules; computing pairwise distances between the given previously classified software module and respective ones of the retrieved previously classified software modules in the neighborhood; and clustering the given previously classified software module with one or more of the retrieved previously classified software modules having pairwise distances less than a second threshold, the second threshold being smaller than the first threshold. 12. The method of claim 1 wherein classifying the first software module comprises comparing distance metrics between static features in the extracted feature set of the first software module and corresponding static features of previously classified software modules in the plurality of clusters. 13. The method of claim 12 further comprising classifying the first software module as benign based at least in part on determining that the distance metrics between the static features in the extracted feature set of the first software module and corresponding static features of a given cluster of software modules previously classified as benign is below a given threshold. 14. The method of claim 13 further comprising classifying the first software module as potentially malicious based at least in part on: determining that the distance metrics between the static features in the extracted feature set of the first software module and corresponding static features of a given cluster of software modules previously classified as potentially malicious is below a first threshold; and determining that the distance metrics between the static features, the behavior features and the context features in the extracted feature set of the first software module and corresponding features of the given cluster of software modules previously classified as potentially malicious is below a second threshold. 15. The method of claim 1 wherein modifying access by the given client device to the first software module comprises at least one of: removing the first software module from a memory or storage of the given client device; preventing the given client device from obtaining the first software module; and causing the first software module to be opened in a sandboxed application environment on the given client device. 16. The method of claim 1 , wherein the one or more behavior features are associated with actions performed by the first software module installed on said at least one of the plurality of client devices and wherein the one or more context features are associated with installation of the first software module on said at least one of the plurality of client devices. 17. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device cause the at least one processing device: to obtain at least a first software module not classified as benign or potentially malicious; to extract a set of features associated with the first software module, the set of features comprising one or more static features, one or more behavior fe

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • Event detection, e.g. attack signature detection · CPC title

  • using dedicated hardware · CPC title

  • Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title

  • Clustering or classification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9998484B1 cover?
A method comprises obtaining at least a first software module not classified as benign or potentially malicious, extracting a set of features associated with the first software module including static, behavior and context features, computing distance metrics between the extracted feature set and feature sets of a plurality of clusters including one or more clusters of software modules previous…
Who is the assignee on this patent?
Emc Corp, Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification H04L63/1416. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Jun 12 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).