Machine-learning based approach for malware sample clustering

US2021304013A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2021304013-A1
Application numberUS-202016836883-A
CountryUS
Kind codeA1
Filing dateMar 31, 2020
Priority dateMar 31, 2020
Publication dateSep 30, 2021
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for a machine learning based approach for identification of malware using static analysis and a machine-learning based automatic clustering of malware are provided. According to various embodiments of the present disclosure, a processing resource of a computer system receives a potential malware sample. A plurality of feature vectors is extracted from the potential malware sample and is converted into an input vector. A byte sequence is generated by walking a plurality of decision trees based on the input vector. Further, a hash value for the byte sequence is calculated and a determination is made regarding whether the hash value matches a malware hash value of a plurality of malware hash values corresponding to a known malware sample. Upon said determination being affirmative, the potential malware sample is classified as malware and is associated with a malware family of the known malware sample.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: receiving, by a processing resource of a computer system, a potential malware sample; extracting, by the processing resource, a plurality of feature vectors from the potential malware sample, wherein the plurality of feature vectors represent values of static features of the potential malware sample; converting, by the processing resource, the plurality of feature vectors into an input vector; generating, by the processing resource, a byte sequence by walking a plurality of decision trees based on the input vector, wherein the plurality of decision trees are associated with a machine-learning model that has been trained based on the static features of a set of known malware samples; calculating, by the processing resource, a hash value for the byte sequence; determining, by the processing resource, whether the hash value matches a malware hash value of a plurality of malware hash values corresponding to a known malware sample of the set of known malware samples; and when said determining is affirmative, classifying, by the processing resource, the potential malware sample as malware and associating the malware with a malware family of the known malware sample. 2 . The method of claim 1 , further comprising when said determining is negative: determining, by the processing resource, whether the byte sequence meets a predetermined or configurable threshold of similarity with a malware byte sequence of a particular known malware sample of the set of known malware samples; and when said determining whether the byte sequence meets a predetermined or configurable threshold of similarity is affirmative: classifying, by the processing resource, the potential malware sample as malware; and treating, by the processing resource, the malware as a variant of a malware family of the particular known malware sample by adding the potential malware sample to the set of known malware samples as part of a new cluster within the set of known malware samples. 3 . The method of claim 1 , wherein the machine-learning model comprises a Random Forest model and wherein the plurality of decision trees comprises binary decision trees. 4 . The method of claim 1 , wherein said walking the plurality of decision trees based on the input vector comprises: for each a binary decision tree of the plurality of decision trees: evaluating an expression involving one or more features of the plurality of features associated with a current node starting with a root node of the binary decision tree and ending at a leaf node of the binary decision tree; when said evaluating causes a left branch of the current node to be taken, assigning a first value to a portion of the byte sequence corresponding to the current node; and when said evaluating causes a right branch of the current node to be taken, assigning a second value to the portion of the byte sequence. 5 . The method of claim 3 , wherein the binary decision trees are Classification and Regression Trees (CART), where each of a node of the CART trees has at most two branches. 6 . The method of claim 1 , wherein when the hash value of the malware matches to at least one of the malware hash value of the plurality of malware hash values corresponding to the at least one of known malware sample of the set of known malware samples, associating, by the processing resource, the malware with the malware family of the matched at least one of known malware sample. 7 . The method of claim 1 , wherein the hash value is calculated by concatenating the generated byte sequence to form a unique predefined byte sequence. 8 . The method of claim 1 , wherein the plurality of feature vectors comprises any or a combination of entry point information, an import table, resource information, a DOTNET structural data, and a set of text strings pertaining to the potential malware sample. 9 . The method of claim 1 , wherein the processing resource is configured on a cloud based service. 10 . A non-transitory computer-readable storage medium embodying a set of instructions, which when executed by a processing resource of a computing system, causes the processing resource to perform a method comprising: receiving a potential malware sample; extracting a plurality of feature vectors from the potential malware sample, wherein the plurality of feature vectors represent values of static features of the potential malware sample; converting the plurality of feature vectors into an input vector; generating a byte sequence by walking a plurality of decision trees based on the input vector, wherein the plurality of decision trees are associated with a machine-learning model that has been trained based on the static features of a set of known malware samples; calculating a hash value for the byte sequence; determining whether the hash value matches a malware hash value of a plurality of malware hash values corresponding to a known malware sample of the set of known malware samples; and when said determining is affirmative, classifying the potential malware sample as malware and associating the malware with a malware family of the known malware sample. 11 . The non-transitory computer-readable storage medium of claim 10 , further comprising when said determining is negative: determining whether the byte sequence meets a predetermined or configurable threshold of similarity with a malware byte sequence of a particular known malware sample of the set of known malware samples; and when said determining whether the byte sequence meets a predetermined or configurable threshold of similarity is affirmative: classifying the potential malware sample as malware; and treating the malware as a variant of a malware family of the particular known malware sample by adding the potential malware sample to the set of known malware samples as part of a new cluster within the set of known malware samples. 12 . The non-transitory computer-readable storage medium of claim 10 , wherein the machine-learning model comprises a Random Forest model and wherein the plurality of decision trees comprises binary decision trees. 13 . The non-transitory computer-readable storage medium of claim 10 , wherein said walking the plurality of decision trees based on the input vector comprises: for each a binary decision tree of the plurality of decision trees: evaluating an expression involving one or more features of the plurality of features associated with a current node starting with a root node of the binary decision tree and ending at a leaf node of the binary decision tree; when said evaluating causes a left branch of the current node to be taken, assigning a first value to a portion of the byte sequence corresponding to the current node; and when said evaluating causes a right branch of the current node to be taken, assigning a second value to the portion of the byte sequence. 14 . The non-transitory computer-readable storage medium of claim 12 , wherein the binary decision trees are Classification and Regression Trees (CART), where each of a node of the CART trees has at most two branches. 15 . The non-transitory computer-readable storage medium of claim 10 , wherein when the hash value of the malware matches to at least one of the malware hash value of the plurality of malware hash values corresponding to the at least one of known malware sample of the set of known malware samples, associating, by the processing resource, the malware with the malware family of the matched at least one of known malware sample. 16 . The non-transitory computer-r

Assignees

Inventors

Classifications

  • G06N5/01Primary

    Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

  • Ensemble learning · CPC title

  • G06F21/564Primary

    by virus signature recognition · CPC title

  • Test or assess software · CPC title

  • by checking file integrity · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021304013A1 cover?
Systems and methods for a machine learning based approach for identification of malware using static analysis and a machine-learning based automatic clustering of malware are provided. According to various embodiments of the present disclosure, a processing resource of a computer system receives a potential malware sample. A plurality of feature vectors is extracted from the potential malware s…
Who is the assignee on this patent?
Fortinet Inc
What technology area does this patent fall under?
Primary CPC classification G06N5/01. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Sep 30 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).