Scalable training of random forests for high precise malware detection

US2019102337A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2019102337-A1
Application numberUS-201715722412-A
CountryUS
Kind codeA1
Filing dateOct 2, 2017
Priority dateOct 2, 2017
Publication dateApr 4, 2019
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In one embodiment, a device trains a machine learning-based malware classifier using a first randomly selected subset of samples from a training dataset. The classifier comprises a random decision forest. The device identifies, using at least a portion of the training dataset as input to the malware classifier, a set of misclassified samples from the training dataset that the malware classifier misclassifies. The device retrains the malware classifier using a second randomly selected subset of samples from the training dataset and the identified set of misclassified samples. The device adjusts prediction labels of individual leaves of the random decision forest of the retrained malware classifier based in part on decision changes in the forest that result from assessing the entire training dataset with the classifier. The device sends the malware classifier with the adjusted prediction labels for deployment into a network.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: training, by a device, a machine learning-based malware classifier using a first randomly selected subset of samples from a training dataset, wherein the classifier comprises a random decision forest; identifying, by the device and using at least a portion of the training dataset as input to the malware classifier, a set of misclassified samples from the training dataset that the malware classifier misclassifies; retraining, by the device, the malware classifier using a second randomly selected subset of samples from the training dataset and the identified set of misclassified samples; adjusting, by the device, prediction labels of individual leaves of the random decision forest of the retrained malware classifier based in part on decision changes in the forest that result from assessing the entire training dataset with the classifier; and sending, by the device, the malware classifier with the adjusted prediction labels for deployment into a network. 2 . The method as in claim 1 , further comprising: iteratively, by the device and before adjusting the prediction labels, repeating the identifying and retraining steps using different randomly selected subsets of samples from the training dataset until a stopping criterion is met. 3 . The method as in claim 2 , wherein the stopping criterion comprises at least one of: a predefined number of iterations or no additional misclassified samples are identified in an iteration. 4 . The method as in claim 3 , wherein the predefined number of iterations is five or fewer. 5 . The method as in claim 1 , further comprising: pruning, by the device and after adjusting the prediction labels of the individual leaves of the random decision forest, leaves from the random decision forest. 6 . The method as in claim 5 , wherein pruning the leaves from the random decision forest comprises: merging child nodes of a parent node in the random decision forest into the parent node, when the child nodes give equivalent malware predictions. 7 . The method as in claim 1 , wherein adjusting the prediction labels of individual leaves of the random decision forest comprises: computing, by the device, a histogram of objects from the training dataset that are assessed by a particular leaf of the random decision forest; and using, by the device, the histogram to determine a final prediction label for the particular leaf. 8 . The method as in claim 7 , wherein using the histogram to determine the final prediction label for the leaf comprises: performing, by the device, soft voting of class labels in the particular leaf. 9 . The method as in claim 7 , wherein using the histogram to determine the final prediction label for the leaf comprises: identifying, by the device, the final prediction label as a predicted class having the highest object count. 10 . An apparatus comprising: one or more network interfaces to communicate with a network; a processor coupled to the network interfaces and configured to execute one or more processes; and a memory configured to store a process executable by the processor, the process when executed configured to: train a machine learning-based malware classifier using a first randomly selected subset of samples from a training dataset, wherein the classifier comprises a random decision forest; identify, using at least a portion of the training dataset as input to the malware classifier, a set of misclassified samples from the training dataset that the malware classifier misclassifies; retrain the malware classifier using a second randomly selected subset of samples from the training dataset and the identified set of misclassified samples; adjust prediction labels of individual leaves of the random decision forest of the retrained malware classifier based in part on decision changes in the forest that result from assessing the entire training dataset with the classifier; and send the malware classifier with the adjusted prediction labels for deployment into a network. 11 . The apparatus as in claim 10 , wherein the process when executed is further configured to: iteratively, and before adjusting the prediction labels, repeating the identifying and retraining steps using different randomly selected subsets of samples from the training dataset until a stopping criterion is met. 12 . The apparatus as in claim 11 , wherein the stopping criterion comprises at least one of: a predefined number of iterations or no additional misclassified samples are identified in an iteration. 13 . The apparatus as in claim 12 , wherein the predefined number of iterations is five or fewer. 14 . The apparatus as in claim 10 , wherein the process when executed is further configured to: prune, after adjusting the prediction labels of the individual leaves of the random decision forest, leaves from the random decision forest. 15 . The apparatus as in claim 14 , wherein pruning the leaves from the random decision forest comprises: merging child nodes of a parent node in the random decision forest into the parent node, when the child nodes give equivalent malware predictions. 16 . The apparatus as in claim 10 , wherein the apparatus adjusts the prediction labels of individual leaves of the random decision forest by: computing a histogram of objects from the training dataset that are assessed by a particular leaf of the random decision forest; and using the histogram to determine a final prediction label for the particular leaf. 17 . The apparatus as in claim 16 , wherein the apparatus uses the histogram to determine the final prediction label for the leaf by: performing, soft voting of class labels in the particular leaf. 18 . The apparatus as in claim 16 , wherein the apparatus uses the histogram to determine the final prediction label for the leaf by: identifying the final prediction label as a predicted class having the highest object count. 19 . A tangible, non-transitory, computer-readable medium storing program instructions that cause a device to execute a process comprising: training, by the device, a machine learning-based malware classifier using a first randomly selected subset of samples from a training dataset, wherein the classifier comprises a random decision forest; identifying, by the device and using at least a portion of the training dataset as input to the malware classifier, a set of misclassified samples from the training dataset that the malware classifier misclassifies; retraining, by the device, the malware classifier using a second randomly selected subset of samples from the training dataset and the identified set of misclassified samples; adjusting, by the device, prediction labels of individual leaves of the random decision forest of the retrained malware classifier based in part on decision changes in the forest that result from assessing the entire training dataset with the classifier; and sending, by the device, the malware classifier with the adjusted prediction labels for deployment into a network. 20 . The computer-readable medium as in claim 19 , wherein the process when executed further comprises: pruning, by the device and after adjusting the prediction labels of the individual leaves of the random decision forest, leaves from the random decision forest.

Assignees

Inventors

Classifications

  • G06N20/20Primary

    Ensemble learning · CPC title

  • Machine learning · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • Tree-organised classifiers · CPC title

  • Classification techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2019102337A1 cover?
In one embodiment, a device trains a machine learning-based malware classifier using a first randomly selected subset of samples from a training dataset. The classifier comprises a random decision forest. The device identifies, using at least a portion of the training dataset as input to the malware classifier, a set of misclassified samples from the training dataset that the malware classifier…
Who is the assignee on this patent?
Cisco Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06N20/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Apr 04 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).