Methods and apparatus for unknown sample classification using agglomerative clustering
US-2021342447-A1 · Nov 4, 2021 · US
US11977632B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11977632-B2 |
| Application number | US-202117238248-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 23, 2021 |
| Priority date | Apr 23, 2020 |
| Publication date | May 7, 2024 |
| Grant date | May 7, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed are methods and apparatuses for classifier evaluation. The evaluation involves constructing a ground truth refinement having a degree of error within specified bounds from a malware reference dataset as an approximate ground truth refinement. The evaluation further involves using the approximate ground truth refinement to determine at least one of: a lower bound on precision or an upper bound on recall and accuracy. The evaluation further involves evaluating a classifier by evaluating at least one of a classification method or clustering method by examining changes to the upper bound and/or the lower bound produced by the approximate ground truth refinement.
Opening claim text (preview).
What is claimed is: 1. A method for malware classifier evaluation, the method comprising: deriving, using a processor, a collection of malware samples; constructing, using the processor, a ground truth refinement from a reference dataset comprising the malware samples, the ground truth refinement having a degree of error within specified bounds from the reference dataset as an approximate ground truth refinement, wherein: the reference dataset includes at least some data without a reference label; and a reference label or any data in the dataset is a label a classifier is expected to predict; determining, using the processor: a lower bound on precision using the approximate ground truth refinement; and an upper bound on recall and accuracy using the approximate ground truth refinement; evaluating, using the processor, the classifier for at least one task of a classification method or a clustering method by examining changes to the upper bound or changes to the lower bound produced by the approximate ground truth refinement, wherein the evaluation is performed without having to use a reference label or at least some data in the reference dataset; and detecting, using the processor, biased or misleading evaluation results caused by the classifier method or clustering method. 2. The method of claim 1 , comprising: altering, using the processor, parameters of the classification method or the clustering method of the classifier by examining changes to the upper bound on recall, changes to the upper bound on accuracy, or changes to the lower bound on precision produced by the approximate ground truth refinement. 3. The method of claim 1 , comprising: refining, using the processor, predictions of the classification method or the clustering method using the approximate ground truth refinement to improve accuracy, precision, and/or recall of the classification method or the clustering method. 4. The method of claim 1 , comprising: computing, using the processor, a metadata hash digest of at least two malware dataset samples; assigning, using the processor, to a same malware cluster all malware samples of the at least two malware samples that share a metadata hash digest; assigning, using the processor, to a singleton cluster, each malware sample of the at least two malware samples for which the metadata hash digest cannot be computed. 5. The method of claim 4 , wherein: the metadata hash is a peHash. 6. A method for improving a malware classifier for an input data file, the method comprising: deriving, using a processor, a collection of malware samples; constructing, using the processor, a ground truth refinement from a reference dataset comprising the malware samples, the ground truth refinement having a degree of error within specified bounds from the reference dataset as an approximate ground truth refinement, wherein: the reference dataset includes at least some data without a reference label; and a reference label or any data in the dataset is a label a classifier is expected to predict; determining, using the processor: a lower bound on precision using the approximate ground truth refinement; and an upper bound on recall and accuracy using the approximate ground truth refinement; evaluating, using the processor, the classifier for at least one task of a classification method or clustering method by examining changes to the upper bound or changes to the lower bound produced by the approximate ground truth refinement, wherein the evaluation is performed without having to use a reference label or at least some data in the reference dataset; classifying, using the processor, an input data file using the classifier evaluated by the approximate ground truth refinement having the upper bound and the lower bound; detecting, using the processor, biased or misleading evaluation results caused by the classifier method or the clustering method; and at least one of, using the processor: altering parameters of the classification method or the clustering method used by the classifier; or refining predictions of the classification method or the clustering method used by the classifier. 7. The method according to claim 6 , comprising: classifying, using the processor, the input data file as malware being within a known family of malware using the classifier evaluated by the approximate ground truth refinement having the upper bound and the lower bound. 8. The method of claim 6 , comprising: computing, using the processor, a metadata hash digest of at least two malware samples; assigning to a same malware cluster all malware samples of the at least two malware samples that share a metadata hash digest; assigning, using the processor, to a singleton cluster, each malware sample of the at least two malware samples for which the metadata hash digest cannot be computed. 9. The method of claim 8 , wherein the metadata hash is a peHash. 10. An apparatus for evaluating a malware classifier, the apparatus comprising: a classifier for classifying an input data file; and a hardware processor for evaluating the classifier, wherein the hardware processor includes memory with a computer program which when executed will cause the hardware processor to: derive a collection of malware samples; construct a ground truth refinement from a reference dataset comprising the malware samples, the ground truth refinement having a degree of error within specified bounds from a reference dataset as an approximate ground truth refinement, wherein: the reference data set includes at least some data without a reference label; and a reference label or any data in the dataset is a label a classifier is expected to predict; determine: a lower bound on precision using the approximate ground truth refinement; and an upper bound on recall and accuracy using the approximate ground truth refinement; evaluate the classifier for at least one task of a classification method or a clustering method by examining changes to the upper bound or changes to the lower bound produced by the approximate ground truth refinement, wherein the evaluation is performed without having to use a reference label or at least some data in the reference dataset; and detect biased or misleading evaluation results caused by the classifier method or clustering method. 11. The apparatus of claim 10 , wherein: the computer program which when executed will cause the hardware processor to alter parameters of the classification method or the clustering method of the classifier by examining changes to the upper bound or the lower bound produced by the approximate ground truth refinement. 12. The apparatus of claim 10 , wherein: the computer program which when executed will cause the hardware processor to refine predictions of the classifier method or the clustering method of the classifier using the ground truth refinement to improve accuracy, precision, and/or recall of the classifier method or the clustering method. 13. The apparatus of claim 10 , wherein: the classifier classifies the input data file as malware. 14. A method for malware classifier evaluation, the method comprising: deriving, using a processor, a collection of malware samples; constructing, using the processor, a ground truth refinement from a reference dataset comprising the malware samples, the ground truth refinement having a degree of error within specified bounds from the reference dataset as an approximate ground truth refinement, wherein: the reference dataset includes at least some data without a reference label; and a reference label or any data in the dataset is a label a
by virus signature recognition · CPC title
Clustering techniques · CPC title
Multiple classes · CPC title
involving event detection and direct action · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.