Malware analysis and detection using graph-based characterization and machine learning
US-2017068816-A1 · Mar 9, 2017 · US
US10917415B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10917415-B2 |
| Application number | US-201815867251-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jan 10, 2018 |
| Priority date | Jan 10, 2018 |
| Publication date | Feb 9, 2021 |
| Grant date | Feb 9, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A technique includes processing a plurality of sets of program code to extract call graphs; determining similarities between the call graphs; applying unsupervised machine learning to an input formed from the determined similarities to determine latent features of the input; clustering the determined latent features; and determining a characteristic of a given program code set of the plurality of program code sets based on a result of the clustering.
Opening claim text (preview).
What is claimed is: 1. A method comprising: processing a plurality of program code sets to extract call graphs; determining similarities between the call graphs; applying unsupervised machine learning to an input formed from the determined similarities to determine latent features of the input; clustering the determined latent features; and determining a characteristic of a given program code set of the plurality of program code sets based on a result of the clustering. 2. The method of claim 1 , wherein determining similarities between the call graphs comprises applying seeded graph matching to the plurality of program code sets to determine distances between pairs of the plurality of program code sets. 3. The method of claim 2 , wherein determining distances between the program code sets comprises generating a matrix. 4. The method of claim 3 , wherein generating the matrix comprises generating a similarity matrix. 5. The method of claim 3 , wherein generating the matrix comprises generating a matrix in which each row of the matrix is associated with a program code set of the plurality of program code sets, each columns of the matrix is associated with a program code set of the plurality of program code sets, a given element of the matrix is associated a pair of the program code sets of the plurality of program code sets and represents a distance between the pair. 6. The method of claim 2 , wherein applying seeded graph matching comprises applying a Fast Approximate Quadratic (FAQ) assignment algorithm. 7. The method of claim 1 , wherein determining the similarities comprises determining distances between the call graphs, and the method further comprises normalizing the distances to generate the input for the unsupervised machine learning. 8. The method of claim 1 , wherein applying the unsupervised machine learning comprises applying deep neural network learning. 9. The method of claim 1 , wherein clustering the determined latent features comprises applying k-means clustering. 10. The method of claim 1 , wherein determining the characteristic comprises identifying a characteristic associated with malicious software. 11. The method of claim 10 , further comprising taking corrective action against the given program code set in response to identifying the characteristic. 12. The method of claim 11 , wherein taking corrective action comprises quarantining the given program code set. 13. A non-transitory storage medium storing instructions that, when executed by a processor-based machine, cause a processor to: access data representing control flow graphs, wherein each control flow graph represents a set of machine executable instructions of a plurality of sets of machine executable instructions; determine a similarity matrix based on the control flow graphs; apply neural network-based machine learning to, based on the similarity matrix, determine features of the plurality of sets of machine executable instructions shared in common; cluster the features; and determine a characteristic of a given set of machine executable instructions of the plurality of sets of machine executable instructions based on a result of the clustering. 14. The storage medium of claim 13 , wherein the instructions, when executed by the processor, cause the processor to identify the given set of machine executable instructions of the plurality of sets of machine executable instructions as associated with malicious activity based on the determined features. 15. The storage medium of claim 13 , wherein the instructions, when executed by the processor, cause the processor to determine the similarity matrix based on seeded graph matching. 16. The storage medium of claim 13 , wherein the instructions, when executed by the processor, cause the processor to: train a sparse autoencoder to determine the features; and cluster the sets of machine executable instructions based on the determined features. 17. An apparatus comprising: a processor; and a storage medium to store instructions that, when executed by the processor, cause the processor to: apply seeded graph matching to call graphs associated with a plurality of program code sets to determine distances among the call graphs; apply unsupervised machine learning to the distances to determine latent features of the call graphs; cluster the determined latent features to form a plurality of clusters, wherein each cluster is associated with at least one of the plurality of program code sets, a first program code set is associated with a given cluster of the plurality of clusters, and the given cluster is associated with at least one other program code set of the plurality of program code sets; and characterize the first program code set based on the least one other program code set of the plurality of program code sets. 18. The apparatus of claim 17 , wherein the instructions, when executed by the processor, cause the processor to selectively take corrective action based on the characterization of the first program code set. 19. The apparatus of claim 17 , wherein the instructions, when executed by the processor, cause the processor to: build a sparse autoencoder; and use back propagation to train the sparse autoencoder to determine the latent features of the call graphs. 20. The apparatus of claim 19 , wherein the instructions, when executed by the processor, cause the processor to: determine hidden layers of the sparse autoencoder to reconstruct state of inputs to the hidden layers.
Architecture, e.g. interconnection topology · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Quantised networks; Sparse networks; Compressed networks · CPC title
Machine learning · CPC title
Static detection · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.