Training a machine learning model for container file analysis

US11188646B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11188646-B2
Application numberUS-201916663252-A
CountryUS
Kind codeB2
Filing dateOct 24, 2019
Priority dateSep 1, 2016
Publication dateNov 30, 2021
Grant dateNov 30, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one respect, there is provided a system for training a machine learning model to detect malicious container files. The system may include at least one processor and at least one memory. The at least one memory may include program code that provides operations when executed by the at least one processor. The operations may include: training, based on a training data, a machine learning model to enable the machine learning model to determine whether at least one container file includes at least one file rendering the at least one container file malicious; and providing the trained machine learning model to enable the determination of whether the at least one container file includes at least one file rendering the at least one container file malicious. Related methods and articles of manufacture, including computer program products, are also disclosed.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method comprising: training, based at least on training data, a machine learning model to enable the machine learning model to determine whether at least one container file includes at least one file rendering the at least one container file malicious, each container file encapsulating a plurality of files; and providing the trained machine learning model to enable a determination of whether at least one subsequently received container file includes at least one file rendering the at least one subsequently received container file malicious, the determination comprising a classification of the at least one subsequently received container file which is used to determine whether to access the plurality of files contained within the at least one subsequently received container file; wherein: the training data comprises a plurality of historical container files at least a portion of which are known to include the at least one file rendering the historical container file malicious; the trained machine learning model prevents misclassification by the trained machine learning model for different container files storing identical or similar sets of files in a different order; the trained machine learning model is a convolutional network comprising at least one convolutional layer having one or more learnable kernels configured to extract features from each of the files in the container files and to detect certain combinations of features in overlapping groups of two or more files; the convolutional neural network further comprises a pooling layer configured to identify maximum features from across the files; and the convolutional neural network classifies the container file as malicious or benign based on the identified maximum features. 2. The method of claim 1 , wherein the features comprise one or more of: file name, file path or location, size, creator, owner, or Universal Resource Locator (URL). 3. The method of claim 1 , wherein the at least one file rendering the historical container file malicious comprises a malicious file. 4. The method of claim 3 , wherein the malicious file comprises unwanted data, an unwanted portion of a script, and/or an unwanted portion of program code. 5. The method of claim 1 , wherein the at least one file rendering the historical container file malicious comprises a benign file rendering the historical container file malicious when combined with another benign file from the historical container file. 6. The method of claim 1 , wherein the plurality of files includes a first file, a second file, and a third file. 7. The method of claim 6 further comprising: receiving the training data by at least receiving a first feature vector, a second feature vector, and a third feature vector that include one or more features of the respective first file, the second file, and the third file. 8. The method of claim 7 , wherein the at least one convolution layer is configured to generate a first feature map by at least applying a first kernel to a plurality of overlapping groups of feature vectors. 9. The method of claim 8 , wherein a first overlapping group of feature vectors includes the first feature vector and the second feature vector, and wherein a second overlapping group of feature vectors includes the second feature vector and the third feature vector. 10. The method of claim 9 , wherein applying the first kernel includes computing a dot product between features included in the first kernel and features included in the first overlapping group of feature vectors to generate a first entry in the first feature map, and computing another dot product between features included in the first kernel and features included in the second overlapping group of feature vectors to generate a second entry in the first feature map. 11. The method of claim 10 , wherein the computing of the dot product and the other dot product detects a presence of the features included in the first kernel in the first and second overlapping group of feature vectors. 12. The method of claim 8 , wherein the convolution layer is further configured to generate a second feature map by at least applying a second kernel to the plurality of overlapping groups of feature vectors. 13. The method of claim 12 , wherein the first kernel includes a combination of features, and wherein the second kernel includes a different combination of features. 14. The method of claim 13 , wherein training the machine learning model includes processing the training data with the machine learning model to detect a presence of the at least one file in the training data, back propagating an error in the detection of the at least one file, and adjusting one or more weights and/or biases applied by the machine learning model to minimize the error in the detection of the at least one file. 15. The method of claim 14 further comprising: receiving another training data; and processing the other training data with the machine learning model to detect a presence of at least one file in the other training data rendering the other training data malicious, wherein the training includes readjusting the one or more weights and/or biases applied by the machine learning model to minimize an error in the detection of the at least one file in the other training data. 16. The method of claim 1 , wherein the convolutional neural network further comprises: a dense layer to process the maximum features identified the pooling layer; and a dropout layer to drop out at least a portion of an output of the dense layer to remove sampling noise. 17. A computer-implemented method comprising: training, based at least on training data, a machine learning model to enable the machine learning model to determine whether at least one container file includes at least one file rendering the at least one container file malicious, each container file encapsulating a plurality of files; and providing the trained machine learning model to enable a determination of whether at least one subsequently received container file includes at least one file rendering the at least one subsequently received container file malicious, the determination comprising a classification of the at least one subsequently received container file which is used to determine whether to access the plurality of files contained within the at least one subsequently received container file; wherein: the training data comprises a plurality of historical container files at least a portion of which are known to include the at least one file rendering the historical container file malicious; features utilized by the trained machine learning model are selected from a group consisting of: file name, file path or location, size, creator, owner, or Universal Resource Locator (URL); the trained machine learning model is a convolutional network comprising at least one convolutional layer having one or more learnable kernels configured to extract features from each of the files in the container files and to detect certain combinations of features in overlapping groups of two or more files; the convolutional neural network further comprises a pooling layer configured to identify maximum features from across the files; and the convolutional neural network classifies the container file as malicious or benign based on the identified maximum features. 18. The method of claim 17 , wherein the convolutional neural network further comprises: a dense layer to process the maximum features identified the pooling layer;

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Supervised learning · CPC title

  • Convolutional networks [CNN, ConvNet] · CPC title

  • G06F21/562Primary

    Static detection · CPC title

  • using kernel methods, e.g. support vector machines [SVM] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11188646B2 cover?
In one respect, there is provided a system for training a machine learning model to detect malicious container files. The system may include at least one processor and at least one memory. The at least one memory may include program code that provides operations when executed by the at least one processor. The operations may include: training, based on a training data, a machine learning model …
Who is the assignee on this patent?
Cylance Inc
What technology area does this patent fall under?
Primary CPC classification G06F21/562. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 30 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).