Automated malware family signature generation

US12170679B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12170679-B2
Application numberUS-202318141789-A
CountryUS
Kind codeB2
Filing dateMay 1, 2023
Priority dateAug 28, 2017
Publication dateDec 17, 2024
Grant dateDec 17, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A set of metadata associated with a plurality of samples is received. The samples are clustered. For members of a first cluster, a set of similarities shared among at least a portion of the members of the first cluster is determined. A cluster member is identified within the first cluster, and in response, additional analysis is caused to be performed on the outlier cluster member.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: a processor configured to: receive a set of metadata associated with a plurality of samples; cluster the plurality of samples; determine, for members of a first cluster, a set of similarities shared among at least a portion of the members of the first cluster, including by removing metadata that could negatively affect similarity measurements; and identify an outlier cluster member within the first cluster, and in response to the identifying, cause additional analysis to be performed on the outlier cluster member; and a memory coupled to the processor and configured to provide the processor with instructions. 2. The system of claim 1 , wherein the processor is further configured to determine, for a first sample included in the plurality of samples, a set of features comprising name-value pairs. 3. The system of claim 2 , wherein determining the set of features includes performing a tokenization. 4. The system of claim 1 , wherein the processor is further configured to assign weights to a set of tokens. 5. The system of claim 4 , wherein the weights are assigned using term frequency-inverse document frequency analysis. 6. The system of claim 1 , wherein the processor is further configured to generate a vector list that indicates, for a given sample, a set of tokens hit by the sample. 7. The system of claim 1 , wherein clustering the plurality of samples includes performing multiple rounds of k-means clustering and selecting as output those clusters with consistent membership across the multiple rounds. 8. The system of claim 1 , wherein determining the set of similarities includes determining a portion of metadata that is present in all members of the first cluster. 9. The system of claim 8 , further comprising comparing a size of the first cluster to a number of samples in a corpus that also includes the portion of metadata. 10. The system of claim 1 , wherein the processor is further configured to iteratively perform (1) the clustering, (2) the determining, and (3) evaluating the set of similarities for suitability as a malware family signature until a low-quality threshold is reached. 11. The system of claim 10 , wherein the processor is further configured to exclude metadata associated with samples for which malware signatures were assigned in a previous iteration, prior to performing a current iteration. 12. The system of claim 1 , wherein the processor is further configured to provide as output a list of malware samples matching a generated malware family signature. 13. A method, comprising: receiving a set of metadata associated with a plurality of samples; clustering the plurality samples; determining, for members of a first cluster, a set of similarities shared among at least a portion of the members of the first cluster, including by removing metadata that could negatively affect similarity measurements; and identifying an outlier cluster member within the first cluster, and in response to the identifying, causing additional analysis to be performed on the outlier cluster member. 14. A computer program product embodied in a tangible computer readable storage medium and comprising computer instructions for: receiving a set of metadata associated with a plurality of samples; clustering the plurality of samples; determining, for members of a first cluster, a set of similarities shared among at least a portion of the members of the first cluster, including by removing metadata that could negatively affect similarity measurements; and identifying an outlier cluster member within the first cluster, and in response to the identifying, causing additional analysis to be performed on the outlier cluster member. 15. The method of claim 13 , further comprising determining, for a first sample included in the plurality of samples, a set of features comprising name-value pairs. 16. The method of claim 15 , wherein determining the set of features includes performing a tokenization. 17. The method of claim 13 , further comprising assigning weights to a set of tokens. 18. The method of claim 17 , wherein the weights are assigned using term frequency-inverse document frequency analysis. 19. The method of claim 13 , further comprising generating a vector list that indicates, for a given sample, a set of tokens hit by the sample. 20. The method of claim 13 , wherein clustering the plurality of samples includes performing multiple rounds of k-means clustering and selecting as output those clusters with consistent membership across the multiple rounds. 21. The method of claim 13 , wherein determining the set of similarities includes determining a portion of metadata that is present in all members of the first cluster. 22. The method of claim 21 , further comprising comparing a size of the first cluster to a number of samples in a corpus that also includes the portion of metadata. 23. The method of claim 13 , wherein: (1) the clustering, (2) the determining, and (3) evaluating the set of similarities for suitability as a malware family signature is iteratively performed until a low-quality threshold is reached. 24. The method of claim 23 , further comprising excluding metadata associated with samples for which malware signatures were assigned in a previous iteration, prior to performing a current iteration. 25. The method of claim 13 , further comprising providing as output a list of malware samples matching a generated malware family signature.

Assignees

Inventors

Classifications

  • Machine learning · CPC title

  • H04L63/145Primary

    the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms · CPC title

  • Anti-malware arrangements, e.g. protection against SMS fraud or mobile malware · CPC title

  • Event detection, e.g. attack signature detection · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12170679B2 cover?
A set of metadata associated with a plurality of samples is received. The samples are clustered. For members of a first cluster, a set of similarities shared among at least a portion of the members of the first cluster is determined. A cluster member is identified within the first cluster, and in response, additional analysis is caused to be performed on the outlier cluster member.
Who is the assignee on this patent?
Palo Alto Networks Inc
What technology area does this patent fall under?
Primary CPC classification H04L63/145. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Dec 17 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).