Automatic threat detection of executable files based on static data analysis
US-11409869-B2 · Aug 9, 2022 · US
US12292971B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12292971-B2 |
| Application number | US-202117524930-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 12, 2021 |
| Priority date | Nov 13, 2020 |
| Publication date | May 6, 2025 |
| Grant date | May 6, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Statistical properties of known malware distributions may be used to improve estimates of malware detection metrics such as a base rate of malicious events in a target environment or missed detections (also referred to as false negatives). In particular, numerous synthetic sample distributions may be generated based on the statistical properties of a base data set and/or additional observed data, and used to identify malware distributions that produce overall detection statistics corresponding to model output for live target data. The malware detection metrics for the live target data can then be characterized using the observed distributions of malware (and malware detections) for the synthetic sample distributions.
Opening claim text (preview).
What is claimed is: 1. A computer program product comprising computer executable code embodied in a non-transitory computer readable medium that, when executing on one or more computing devices, performs the steps of: evaluating a true positive rate and a false positive rate for a malware detection system, the true positive rate corresponding to an accurate detection of malware by the malware detection system in a base data set and the false positive rate corresponding to an erroneous detection of malware in the base data set by the malware detection system, the base data set labeled with a known composition of malicious code instances and the base data set having a base rate of malware instances; applying the malware detection system to a new data set to determine a first number of detections within the new data set; generating a number of synthetic data sets with an estimation engine based on a distribution of malware instances within the base data set; selecting a representative group from the number of synthetic data sets that produce a corresponding set of numbers of detection when analyzed with the malware detection system similar to the first number of detections produced within the new data set when analyzed with the malware detection system, wherein the corresponding set of numbers of detection are each within a predetermined threshold of the first number of detections, and wherein the predetermined threshold is a relative threshold scaled according to a ratio of a size of the new data set to the size of each of the synthetic data sets; and determining a malware detection metric for the new data set based on a statistical composition of the representative group. 2. The computer program product of claim 1 , wherein the predetermined threshold is an absolute numerical threshold. 3. The computer program product of claim 1 , wherein the new data set includes live samples analyzed for an enterprise by the malware detection system. 4. The computer program product of claim 1 , wherein evaluating the true positive rate and the false positive rate for the malware detection system includes measuring the true positive rate and the false positive rate for the malware detection system when applied to a base data set having a known composition of malware instances and benign instances. 5. The computer program product of claim 1 , wherein the malware detection system is a machine learning model trained to detect malware based on a training data set, further wherein each software instance in the training data set is labeled to indicate a malware status. 6. The computer program product of claim 1 , further comprising computer executable code that, when executed, performs the step of updating the true positive rate and the false positive rate based on additional software instances received by the malware detection system and automatically labeled by the malware detection system as safe or malicious. 7. A method comprising: evaluating a true detection rate and a false detection rate for a malware detection system when applied to a base data set having a known composition of malicious code instances; applying the malware detection system to a new data set to determine a first detection rate for the new data set; generating a number of synthetic data sets based on one or more properties of the base data set; selecting a representative group from the number of synthetic data sets that produce a corresponding detection rate when analyzed with the malware detection system similar to the first detection rate within the new data set when analyzed with the malware detection system; and determining a malware detection metric for the new data set based on a statistical composition of the representative group selected from the number of synthetic data sets, wherein the malware detection metric includes at least one of a probability distribution for an estimated base rate of malware instances for the new data set and a confidence interval for the estimated base rate of malware instances. 8. The method of claim 7 , further comprising adjusting a security parameter used by a threat management facility to manage security of an enterprise network based on the malware detection metric. 9. The method of claim 7 , wherein the malware detection metric includes an estimated base rate of malware instances for the new data set. 10. The method of claim 7 , wherein the malware detection metric includes at least one of an estimated true positive rate for the new data set and an estimated false positive for the new data set. 11. The method of claim 7 , wherein the malware detection metric includes an estimated number of missed detections for the new data set. 12. The method of claim 7 , wherein the malware detection metric includes a ratio of true positives to false positives for the new data set. 13. The method of claim 7 , wherein the malware detection system includes a machine learning model trained to detect malware based on a training data set. 14. The method of claim 13 , wherein each software instance in the training data set is labeled to indicate a malware status. 15. A system comprising: a memory storing a detection model having a true detection rate and a false detection rate for identifying malware when applied to a base data set having a known malware composition; a malware detection system configured to apply the detection model to a new data set to determine a rate of malware first number of detections occurring within the new data set; an estimation engine configured to synthesize a number of synthetic data sets based on properties of the base data set, and to select a representative group from the number of synthetic data sets that produces a similar rate of malware when analyzed with the malware detection system to the new data set when analyzed with the malware detection system, wherein the estimation engine synthesizes the number of synthetic data sets using a Sequential Monte Carlo simulation to randomly draw samples from the base data set and beta-weighting a result with an increasing beta until an explained sum of squares is within a predetermined threshold of a target; and a scoring engine to calculate one or more malware metrics for the new data set based on the representative group. 16. The system of claim 15 , wherein the estimation engine synthesizes the number of synthetic data sets using a Metropolis-Hastings algorithm to randomly draw candidates from a proposal distribution and conditionally include each randomly drawn candidate using a probability function. 17. The system of claim 15 , wherein the detection model includes a machine learning model trained to detect malware using malware labels for a training data set. 18. The system of claim 15 , further comprising a threat management facility configured to adjust a tuning parameter to control a sensitivity for detection of or response to threats based on the one or more malware metrics. 19. The system of claim 15 , wherein the one or more malware metrics for the new data set includes an estimated true positive rate for the new data set. 20. The system of claim 15 , wherein the one or more malware metrics for the new data set includes an estimated false positive rate for the new data set.
for managing network security; network security policies in general (filtering policies H04L63/0227) · CPC title
Machine learning · CPC title
Test or assess a computer or a system · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Computer malware detection or handling, e.g. anti-virus arrangements · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.