Machine learning-based determination of program code characteristics

US10917415B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10917415-B2
Application numberUS-201815867251-A
CountryUS
Kind codeB2
Filing dateJan 10, 2018
Priority dateJan 10, 2018
Publication dateFeb 9, 2021
Grant dateFeb 9, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A technique includes processing a plurality of sets of program code to extract call graphs; determining similarities between the call graphs; applying unsupervised machine learning to an input formed from the determined similarities to determine latent features of the input; clustering the determined latent features; and determining a characteristic of a given program code set of the plurality of program code sets based on a result of the clustering.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: processing a plurality of program code sets to extract call graphs; determining similarities between the call graphs; applying unsupervised machine learning to an input formed from the determined similarities to determine latent features of the input; clustering the determined latent features; and determining a characteristic of a given program code set of the plurality of program code sets based on a result of the clustering. 2. The method of claim 1 , wherein determining similarities between the call graphs comprises applying seeded graph matching to the plurality of program code sets to determine distances between pairs of the plurality of program code sets. 3. The method of claim 2 , wherein determining distances between the program code sets comprises generating a matrix. 4. The method of claim 3 , wherein generating the matrix comprises generating a similarity matrix. 5. The method of claim 3 , wherein generating the matrix comprises generating a matrix in which each row of the matrix is associated with a program code set of the plurality of program code sets, each columns of the matrix is associated with a program code set of the plurality of program code sets, a given element of the matrix is associated a pair of the program code sets of the plurality of program code sets and represents a distance between the pair. 6. The method of claim 2 , wherein applying seeded graph matching comprises applying a Fast Approximate Quadratic (FAQ) assignment algorithm. 7. The method of claim 1 , wherein determining the similarities comprises determining distances between the call graphs, and the method further comprises normalizing the distances to generate the input for the unsupervised machine learning. 8. The method of claim 1 , wherein applying the unsupervised machine learning comprises applying deep neural network learning. 9. The method of claim 1 , wherein clustering the determined latent features comprises applying k-means clustering. 10. The method of claim 1 , wherein determining the characteristic comprises identifying a characteristic associated with malicious software. 11. The method of claim 10 , further comprising taking corrective action against the given program code set in response to identifying the characteristic. 12. The method of claim 11 , wherein taking corrective action comprises quarantining the given program code set. 13. A non-transitory storage medium storing instructions that, when executed by a processor-based machine, cause a processor to: access data representing control flow graphs, wherein each control flow graph represents a set of machine executable instructions of a plurality of sets of machine executable instructions; determine a similarity matrix based on the control flow graphs; apply neural network-based machine learning to, based on the similarity matrix, determine features of the plurality of sets of machine executable instructions shared in common; cluster the features; and determine a characteristic of a given set of machine executable instructions of the plurality of sets of machine executable instructions based on a result of the clustering. 14. The storage medium of claim 13 , wherein the instructions, when executed by the processor, cause the processor to identify the given set of machine executable instructions of the plurality of sets of machine executable instructions as associated with malicious activity based on the determined features. 15. The storage medium of claim 13 , wherein the instructions, when executed by the processor, cause the processor to determine the similarity matrix based on seeded graph matching. 16. The storage medium of claim 13 , wherein the instructions, when executed by the processor, cause the processor to: train a sparse autoencoder to determine the features; and cluster the sets of machine executable instructions based on the determined features. 17. An apparatus comprising: a processor; and a storage medium to store instructions that, when executed by the processor, cause the processor to: apply seeded graph matching to call graphs associated with a plurality of program code sets to determine distances among the call graphs; apply unsupervised machine learning to the distances to determine latent features of the call graphs; cluster the determined latent features to form a plurality of clusters, wherein each cluster is associated with at least one of the plurality of program code sets, a first program code set is associated with a given cluster of the plurality of clusters, and the given cluster is associated with at least one other program code set of the plurality of program code sets; and characterize the first program code set based on the least one other program code set of the plurality of program code sets. 18. The apparatus of claim 17 , wherein the instructions, when executed by the processor, cause the processor to selectively take corrective action based on the characterization of the first program code set. 19. The apparatus of claim 17 , wherein the instructions, when executed by the processor, cause the processor to: build a sparse autoencoder; and use back propagation to train the sparse autoencoder to determine the latent features of the call graphs. 20. The apparatus of claim 19 , wherein the instructions, when executed by the processor, cause the processor to: determine hidden layers of the sparse autoencoder to reconstruct state of inputs to the hidden layers.

Assignees

Inventors

Classifications

  • Architecture, e.g. interconnection topology · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Quantised networks; Sparse networks; Compressed networks · CPC title

  • Machine learning · CPC title

  • G06F21/562Primary

    Static detection · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10917415B2 cover?
A technique includes processing a plurality of sets of program code to extract call graphs; determining similarities between the call graphs; applying unsupervised machine learning to an input formed from the determined similarities to determine latent features of the input; clustering the determined latent features; and determining a characteristic of a given program code set of the plurality …
Who is the assignee on this patent?
Intel Corp
What technology area does this patent fall under?
Primary CPC classification G06F21/562. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 09 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).