Data-driven identification of malicious files using machine learning and an ensemble of malware detection procedures

US10853489B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10853489-B2
Application numberUS-201816165051-A
CountryUS
Kind codeB2
Filing dateOct 19, 2018
Priority dateOct 19, 2018
Publication dateDec 1, 2020
Grant dateDec 1, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are provided for data-driven ensemble-based malware detection. An exemplary method comprises obtaining a file; extracting metadata from the file; obtaining a plurality of malware detection procedures; selecting a subset of the plurality of malware detection procedures to apply to the file utilizing a likelihood that each of the plurality of malware detection procedures will result in a malware detection for the file based on the extracted metadata; applying the selected subset of the malware detection procedures to the file; and processing results of the subset of malware detection procedures using a machine learning model to determine a probability of the file being malware.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: obtaining a file; extracting metadata from the file; obtaining a plurality of malware detection procedures; selecting, using at least one processing device, a subset of the plurality of malware detection procedures to apply to the file utilizing a likelihood that each of the plurality of malware detection procedures will result in a malware detection for the file based on the extracted metadata; applying, using the at least one processing device, the selected subset of the malware detection procedures to the file; and processing, using the at least one processing device, results of the subset of malware detection procedures using a machine learning model to determine a probability of the file being malware. 2. The method of claim 1 , wherein the step of selecting the subset of the malware detection procedures to apply to the file employs a Bayesian model that determines a probability that a given malware detection procedure will detect malware in the given file based on one or more historical executions of the given malware detection procedure and characteristics of historical files on which the given malware detection procedure was previously executed. 3. The method of claim 2 , further comprising the step of updating the Bayesian model as new files are tested by the given malware detection procedure. 4. The method of claim 2 , further comprising the step of obtaining a configuration of one or more of a substantially maximum number of malware detection procedures to be executed for a given file, a detection probability threshold, and one or more metadata features to be used for training the Bayesian model. 5. The method of claim 1 , wherein the step of processing the results of the subset of the malware detection procedures using the machine learning model employs a supervised machine learning model that processes the results of the subset of the malware detection procedures as an input and models relationships within the results to generate a health score indicating whether the file is malware. 6. The method of claim 5 , wherein the supervised machine learning model is trained using a plurality of historical files classified as malware as positive examples and a plurality of historical files classified as non-malicious as negative examples. 7. The method of claim 1 , further comprising the step of generating one or more alerts for a detected malware based on one or more of a user configuration, at least one predefined rule and a predefined policy. 8. A system, comprising: a memory; and at least one processing device, coupled to the memory, operative to implement the following steps: obtaining a file; extracting metadata from the file; obtaining a plurality of malware detection procedures; selecting a subset of the plurality of malware detection procedures to apply to the file utilizing a likelihood that each of the plurality of malware detection procedures will result in a malware detection for the file based on the extracted metadata; applying the selected subset of the malware detection procedures to the file; and processing results of the subset of malware detection procedures using a machine learning model to determine a probability of the file being malware. 9. The system of claim 8 , wherein the step of selecting the subset of the malware detection procedures to apply to the file employs a Bayesian model that determines a probability that a given malware detection procedure will detect malware in the given file based on one or more historical executions of the given malware detection procedure and characteristics of historical files on which the given malware detection procedure was previously executed. 10. The system of claim 9 , further comprising the steps of updating the Bayesian model as new files are tested by the given malware detection procedure and obtaining a configuration of one or more of a substantially maximum number of malware detection procedures to be executed for a given file, a detection probability threshold, and one or more metadata features to be used for training the Bayesian model. 11. The system of claim 8 , wherein the step of processing the results of the subset of the malware detection procedures using the machine learning model employs a supervised machine learning model that processes the results of the subset of the malware detection procedures as an input and models relationships within the results to generate a health score indicating whether the file is malware. 12. The system of claim 11 , wherein the supervised machine learning model is trained using a plurality of historical files classified as malware as positive examples and a plurality of historical files classified as non-malicious as negative examples. 13. The system of claim 8 , further comprising the step of generating one or more alerts for a detected malware based on one or more of a user configuration, at least one predefined rule and a predefined policy. 14. A computer program product, comprising a non-transitory machine-readable storage medium having encoded therein executable code of one or more software programs, wherein the one or more software programs when executed by at least one processing device perform the following steps: obtaining a file; extracting metadata from the file; obtaining a plurality of malware detection procedures; selecting a subset of the plurality of malware detection procedures to apply to the file utilizing a likelihood that each of the plurality of malware detection procedures will result in a malware detection for the file based on the extracted metadata; applying the selected subset of the malware detection procedures to the file; and processing results of the subset of malware detection procedures using a machine learning model to determine a probability of the file being malware. 15. The computer program product of claim 14 , wherein the step of selecting the subset of the malware detection procedures to apply to the file employs a Bayesian model that determines a probability that a given malware detection procedure will detect malware in the given file based on one or more historical executions of the given malware detection procedure and characteristics of historical files on which the given malware detection procedure was previously executed. 16. The computer program product of claim 15 , further comprising the step of updating the Bayesian model as new files are tested by the given malware detection procedure. 17. The computer program product of claim 15 , further comprising the step of obtaining a configuration of one or more of a substantially maximum number of malware detection procedures to be executed for a given file, a detection probability threshold, and one or more metadata features to be used for training the Bayesian model. 18. The computer program product of claim 14 , wherein the step of processing the results of the subset of the malware detection procedures using the machine learning model employs a supervised machine learning model that processes the results of the subset of the malware detection procedures as an input and models relationships within the results to generate a health score indicating whether the file is malware. 19. The computer program product of claim 18 , wherein the supervised machine learning model is trained using a plurality of historical files classified as malware as positive examples and a plurality of historical files classified as non-malicious as negative examples. 20. The

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Machine learning · CPC title

  • Test or assess software · CPC title

  • involving event detection and direct action · CPC title

  • eliminating virus, restoring damaged files · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10853489B2 cover?
Techniques are provided for data-driven ensemble-based malware detection. An exemplary method comprises obtaining a file; extracting metadata from the file; obtaining a plurality of malware detection procedures; selecting a subset of the plurality of malware detection procedures to apply to the file utilizing a likelihood that each of the plurality of malware detection procedures will result in…
Who is the assignee on this patent?
Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F21/565. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 01 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).