Evaluating automatic malware classifiers in the absence of reference labels

US11977632B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11977632-B2
Application numberUS-202117238248-A
CountryUS
Kind codeB2
Filing dateApr 23, 2021
Priority dateApr 23, 2020
Publication dateMay 7, 2024
Grant dateMay 7, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are methods and apparatuses for classifier evaluation. The evaluation involves constructing a ground truth refinement having a degree of error within specified bounds from a malware reference dataset as an approximate ground truth refinement. The evaluation further involves using the approximate ground truth refinement to determine at least one of: a lower bound on precision or an upper bound on recall and accuracy. The evaluation further involves evaluating a classifier by evaluating at least one of a classification method or clustering method by examining changes to the upper bound and/or the lower bound produced by the approximate ground truth refinement.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for malware classifier evaluation, the method comprising: deriving, using a processor, a collection of malware samples; constructing, using the processor, a ground truth refinement from a reference dataset comprising the malware samples, the ground truth refinement having a degree of error within specified bounds from the reference dataset as an approximate ground truth refinement, wherein: the reference dataset includes at least some data without a reference label; and a reference label or any data in the dataset is a label a classifier is expected to predict; determining, using the processor: a lower bound on precision using the approximate ground truth refinement; and an upper bound on recall and accuracy using the approximate ground truth refinement; evaluating, using the processor, the classifier for at least one task of a classification method or a clustering method by examining changes to the upper bound or changes to the lower bound produced by the approximate ground truth refinement, wherein the evaluation is performed without having to use a reference label or at least some data in the reference dataset; and detecting, using the processor, biased or misleading evaluation results caused by the classifier method or clustering method. 2. The method of claim 1 , comprising: altering, using the processor, parameters of the classification method or the clustering method of the classifier by examining changes to the upper bound on recall, changes to the upper bound on accuracy, or changes to the lower bound on precision produced by the approximate ground truth refinement. 3. The method of claim 1 , comprising: refining, using the processor, predictions of the classification method or the clustering method using the approximate ground truth refinement to improve accuracy, precision, and/or recall of the classification method or the clustering method. 4. The method of claim 1 , comprising: computing, using the processor, a metadata hash digest of at least two malware dataset samples; assigning, using the processor, to a same malware cluster all malware samples of the at least two malware samples that share a metadata hash digest; assigning, using the processor, to a singleton cluster, each malware sample of the at least two malware samples for which the metadata hash digest cannot be computed. 5. The method of claim 4 , wherein: the metadata hash is a peHash. 6. A method for improving a malware classifier for an input data file, the method comprising: deriving, using a processor, a collection of malware samples; constructing, using the processor, a ground truth refinement from a reference dataset comprising the malware samples, the ground truth refinement having a degree of error within specified bounds from the reference dataset as an approximate ground truth refinement, wherein: the reference dataset includes at least some data without a reference label; and a reference label or any data in the dataset is a label a classifier is expected to predict; determining, using the processor: a lower bound on precision using the approximate ground truth refinement; and an upper bound on recall and accuracy using the approximate ground truth refinement; evaluating, using the processor, the classifier for at least one task of a classification method or clustering method by examining changes to the upper bound or changes to the lower bound produced by the approximate ground truth refinement, wherein the evaluation is performed without having to use a reference label or at least some data in the reference dataset; classifying, using the processor, an input data file using the classifier evaluated by the approximate ground truth refinement having the upper bound and the lower bound; detecting, using the processor, biased or misleading evaluation results caused by the classifier method or the clustering method; and at least one of, using the processor: altering parameters of the classification method or the clustering method used by the classifier; or refining predictions of the classification method or the clustering method used by the classifier. 7. The method according to claim 6 , comprising: classifying, using the processor, the input data file as malware being within a known family of malware using the classifier evaluated by the approximate ground truth refinement having the upper bound and the lower bound. 8. The method of claim 6 , comprising: computing, using the processor, a metadata hash digest of at least two malware samples; assigning to a same malware cluster all malware samples of the at least two malware samples that share a metadata hash digest; assigning, using the processor, to a singleton cluster, each malware sample of the at least two malware samples for which the metadata hash digest cannot be computed. 9. The method of claim 8 , wherein the metadata hash is a peHash. 10. An apparatus for evaluating a malware classifier, the apparatus comprising: a classifier for classifying an input data file; and a hardware processor for evaluating the classifier, wherein the hardware processor includes memory with a computer program which when executed will cause the hardware processor to: derive a collection of malware samples; construct a ground truth refinement from a reference dataset comprising the malware samples, the ground truth refinement having a degree of error within specified bounds from a reference dataset as an approximate ground truth refinement, wherein: the reference data set includes at least some data without a reference label; and a reference label or any data in the dataset is a label a classifier is expected to predict; determine: a lower bound on precision using the approximate ground truth refinement; and an upper bound on recall and accuracy using the approximate ground truth refinement; evaluate the classifier for at least one task of a classification method or a clustering method by examining changes to the upper bound or changes to the lower bound produced by the approximate ground truth refinement, wherein the evaluation is performed without having to use a reference label or at least some data in the reference dataset; and detect biased or misleading evaluation results caused by the classifier method or clustering method. 11. The apparatus of claim 10 , wherein: the computer program which when executed will cause the hardware processor to alter parameters of the classification method or the clustering method of the classifier by examining changes to the upper bound or the lower bound produced by the approximate ground truth refinement. 12. The apparatus of claim 10 , wherein: the computer program which when executed will cause the hardware processor to refine predictions of the classifier method or the clustering method of the classifier using the ground truth refinement to improve accuracy, precision, and/or recall of the classifier method or the clustering method. 13. The apparatus of claim 10 , wherein: the classifier classifies the input data file as malware. 14. A method for malware classifier evaluation, the method comprising: deriving, using a processor, a collection of malware samples; constructing, using the processor, a ground truth refinement from a reference dataset comprising the malware samples, the ground truth refinement having a degree of error within specified bounds from the reference dataset as an approximate ground truth refinement, wherein: the reference dataset includes at least some data without a reference label; and a reference label or any data in the dataset is a label a

Assignees

Inventors

Classifications

  • G06F21/564Primary

    by virus signature recognition · CPC title

  • Clustering techniques · CPC title

  • Multiple classes · CPC title

  • involving event detection and direct action · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11977632B2 cover?
Disclosed are methods and apparatuses for classifier evaluation. The evaluation involves constructing a ground truth refinement having a degree of error within specified bounds from a malware reference dataset as an approximate ground truth refinement. The evaluation further involves using the approximate ground truth refinement to determine at least one of: a lower bound on precision or an upp…
Who is the assignee on this patent?
Booz Allen Hamilton Inc
What technology area does this patent fall under?
Primary CPC classification G06F21/564. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 07 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).