What technology area does this patent fall under?

Primary CPC classification G06N5/01. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Sep 30 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).

Machine-learning based approach for malware sample clustering

US2021304013A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2021304013-A1
Application number	US-202016836883-A
Country	US
Kind code	A1
Filing date	Mar 31, 2020
Priority date	Mar 31, 2020
Publication date	Sep 30, 2021
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for a machine learning based approach for identification of malware using static analysis and a machine-learning based automatic clustering of malware are provided. According to various embodiments of the present disclosure, a processing resource of a computer system receives a potential malware sample. A plurality of feature vectors is extracted from the potential malware sample and is converted into an input vector. A byte sequence is generated by walking a plurality of decision trees based on the input vector. Further, a hash value for the byte sequence is calculated and a determination is made regarding whether the hash value matches a malware hash value of a plurality of malware hash values corresponding to a known malware sample. Upon said determination being affirmative, the potential malware sample is classified as malware and is associated with a malware family of the known malware sample.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: receiving, by a processing resource of a computer system, a potential malware sample; extracting, by the processing resource, a plurality of feature vectors from the potential malware sample, wherein the plurality of feature vectors represent values of static features of the potential malware sample; converting, by the processing resource, the plurality of feature vectors into an input vector; generating, by the processing resource, a byte sequence by walking a plurality of decision trees based on the input vector, wherein the plurality of decision trees are associated with a machine-learning model that has been trained based on the static features of a set of known malware samples; calculating, by the processing resource, a hash value for the byte sequence; determining, by the processing resource, whether the hash value matches a malware hash value of a plurality of malware hash values corresponding to a known malware sample of the set of known malware samples; and when said determining is affirmative, classifying, by the processing resource, the potential malware sample as malware and associating the malware with a malware family of the known malware sample. 2 . The method of claim 1 , further comprising when said determining is negative: determining, by the processing resource, whether the byte sequence meets a predetermined or configurable threshold of similarity with a malware byte sequence of a particular known malware sample of the set of known malware samples; and when said determining whether the byte sequence meets a predetermined or configurable threshold of similarity is affirmative: classifying, by the processing resource, the potential malware sample as malware; and treating, by the processing resource, the malware as a variant of a malware family of the particular known malware sample by adding the potential malware sample to the set of known malware samples as part of a new cluster within the set of known malware samples. 3 . The method of claim 1 , wherein the machine-learning model comprises a Random Forest model and wherein the plurality of decision trees comprises binary decision trees. 4 . The method of claim 1 , wherein said walking the plurality of decision trees based on the input vector comprises: for each a binary decision tree of the plurality of decision trees: evaluating an expression involving one or more features of the plurality of features associated with a current node starting with a root node of the binary decision tree and ending at a leaf node of the binary decision tree; when said evaluating causes a left branch of the current node to be taken, assigning a first value to a portion of the byte sequence corresponding to the current node; and when said evaluating causes a right branch of the current node to be taken, assigning a second value to the portion of the byte sequence. 5 . The method of claim 3 , wherein the binary decision trees are Classification and Regression Trees (CART), where each of a node of the CART trees has at most two branches. 6 . The method of claim 1 , wherein when the hash value of the malware matches to at least one of the malware hash value of the plurality of malware hash values corresponding to the at least one of known malware sample of the set of known malware samples, associating, by the processing resource, the malware with the malware family of the matched at least one of known malware sample. 7 . The method of claim 1 , wherein the hash value is calculated by concatenating the generated byte sequence to form a unique predefined byte sequence. 8 . The method of claim 1 , wherein the plurality of feature vectors comprises any or a combination of entry point information, an import table, resource information, a DOTNET structural data, and a set of text strings pertaining to the potential malware sample. 9 . The method of claim 1 , wherein the processing resource is configured on a cloud based service. 10 . A non-transitory computer-readable storage medium embodying a set of instructions, which when executed by a processing resource of a computing system, causes the processing resource to perform a method comprising: receiving a potential malware sample; extracting a plurality of feature vectors from the potential malware sample, wherein the plurality of feature vectors represent values of static features of the potential malware sample; converting the plurality of feature vectors into an input vector; generating a byte sequence by walking a plurality of decision trees based on the input vector, wherein the plurality of decision trees are associated with a machine-learning model that has been trained based on the static features of a set of known malware samples; calculating a hash value for the byte sequence; determining whether the hash value matches a malware hash value of a plurality of malware hash values corresponding to a known malware sample of the set of known malware samples; and when said determining is affirmative, classifying the potential malware sample as malware and associating the malware with a malware family of the known malware sample. 11 . The non-transitory computer-readable storage medium of claim 10 , further comprising when said determining is negative: determining whether the byte sequence meets a predetermined or configurable threshold of similarity with a malware byte sequence of a particular known malware sample of the set of known malware samples; and when said determining whether the byte sequence meets a predetermined or configurable threshold of similarity is affirmative: classifying the potential malware sample as malware; and treating the malware as a variant of a malware family of the particular known malware sample by adding the potential malware sample to the set of known malware samples as part of a new cluster within the set of known malware samples. 12 . The non-transitory computer-readable storage medium of claim 10 , wherein the machine-learning model comprises a Random Forest model and wherein the plurality of decision trees comprises binary decision trees. 13 . The non-transitory computer-readable storage medium of claim 10 , wherein said walking the plurality of decision trees based on the input vector comprises: for each a binary decision tree of the plurality of decision trees: evaluating an expression involving one or more features of the plurality of features associated with a current node starting with a root node of the binary decision tree and ending at a leaf node of the binary decision tree; when said evaluating causes a left branch of the current node to be taken, assigning a first value to a portion of the byte sequence corresponding to the current node; and when said evaluating causes a right branch of the current node to be taken, assigning a second value to the portion of the byte sequence. 14 . The non-transitory computer-readable storage medium of claim 12 , wherein the binary decision trees are Classification and Regression Trees (CART), where each of a node of the CART trees has at most two branches. 15 . The non-transitory computer-readable storage medium of claim 10 , wherein when the hash value of the malware matches to at least one of the malware hash value of the plurality of malware hash values corresponding to the at least one of known malware sample of the set of known malware samples, associating, by the processing resource, the malware with the malware family of the matched at least one of known malware sample. 16 . The non-transitory computer-r

Assignees

Fortinet Inc

Inventors

Classifications

G06N5/01Primary
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title
G06N20/20
Ensemble learning · CPC title
G06F21/564Primary
by virus signature recognition · CPC title
G06F2221/033
Test or assess software · CPC title
G06F21/565
by checking file integrity · CPC title

Patent family

Related publications grouped by family.

View patent family 77857002

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2021304013A1 cover?: Systems and methods for a machine learning based approach for identification of malware using static analysis and a machine-learning based automatic clustering of malware are provided. According to various embodiments of the present disclosure, a processing resource of a computer system receives a potential malware sample. A plurality of feature vectors is extracted from the potential malware s…
Who is the assignee on this patent?: Fortinet Inc
What technology area does this patent fall under?: Primary CPC classification G06N5/01. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Sep 30 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).