Classifying software scripts utilizing deep learning networks

US10581888B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10581888-B1
Application numberUS-201715664925-A
CountryUS
Kind codeB1
Filing dateJul 31, 2017
Priority dateJul 31, 2017
Publication dateMar 3, 2020
Grant dateMar 3, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method includes generating a tokenized representation of a given software script, the tokenized representation comprising two or more tokens representing two or more commands in the given software script. The method also includes mapping the tokens of the tokenized representation to a vector space providing contextual representation of the tokens utilizing an embedding layer of a deep learning network, detecting sequences of the mapped tokens representing sequences of commands associated with designated types of script behavior utilizing at least one hidden layer of the deep learning network, and classifying the given software script based on the detected sequences of the mapped tokens utilizing one or more classification layers of the deep learning network. The method further includes modifying access by a given client device to the given software script responsive to classifying the given software script as a given software script type.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: generating a tokenized representation of a given software script, the tokenized representation comprising two or more tokens representing two or more commands in the given software script; mapping the tokens of the tokenized representation to a vector space providing contextual representation of the tokens utilizing an embedding layer of a deep learning network; detecting sequences of the mapped tokens representing sequences of commands associated with designated types of script behavior utilizing at least one hidden layer of the deep learning network; classifying the given software script based on the detected sequences of the mapped tokens utilizing one or more classification layers of the deep learning network; and modifying access by a given client device to the given software script responsive to classifying the given software script as a given software script type; wherein generating the tokenized representation comprises: generating an array comprising a set of ordered token values corresponding to an order of the tokens in the given software script, wherein a given one of the token values comprises either (i) an index value representing a known script command in a vocabulary of known script commands of a scripting language utilized by the given software script or (ii) a designated value representing an unknown script command not in the vocabulary of known script commands; determining whether the array comprises a representation of a first type, the representation of the first type comprising at least a threshold number of consecutive instances of the designated value representing unknown script commands between a first token of the array comprising a first index value representing one of the known script commands and a second token of the array comprising a second index value representing one of the known script commands; and responsive to determining that the array comprises the representation of the first type, converting the representation of the first type to a representing of a second type different than the first type by altering at least one of an ordering and a number of token values in the array having the designated value representing unknown script commands such that there is less than the threshold number of consecutive instances of the designated value representing unknown script commands between the first token and the second token; wherein the method is performed by at least one processing device comprising a processor coupled to a memory. 2. The method of claim 1 wherein generating the tokenized representation of the given software script comprises tokenizing the given software script into the two or more tokens based on delimiters of a scripting language utilized by the given software script. 3. The method of claim 1 wherein converting the representation of the first type to the representation of the second type comprises: removing one or more of the tokens having the designated value between the first token and the second token; and padding the tokenized representation with an instance of the designated value for each removed token before the first token or after the second token. 4. The method of claim 1 wherein converting the representation of the first type to the representation of the second type comprises: identifying a sequence of two or more tokens having the designated value between the first token and the second token; removing one or more of the tokens having the designated value between first token and the second token while leaving at least one token having the designated value between the first token and the second token; and padding the tokenized representation with an instance of the designated value for each removed token before the first token or after the second token. 5. The method of claim 1 wherein converting the representation of the first type to the representation of the second type comprises: moving tokens represented by index values in the vocabulary to one of a beginning and an end of the tokenized representation; and moving tokens having the designated value to the other one of the beginning and the end of the tokenized representation. 6. The method of claim 1 wherein the embedding layer of the deep learning network is configured to map the tokens to the vector space such that a distance between a given vector value of a given token representing a given command in the vocabulary of known script commands and an additional vector value of an additional token representing an additional command in the vocabulary of known script commands is based on a similarity of the given command and the additional command. 7. The method of claim 1 wherein the at least one hidden layer comprises a sequence of two or more hidden layers each comprising a convolutional layer. 8. The method of claim 1 wherein the at least one hidden layer comprises a convolutional layer comprising two or more convolution filters configured to activate in response to detecting a feature in the given software script. 9. The method of claim 8 wherein the convolutional layer applies Rectified Linear Units (ReLU) activation functions to its output. 10. The method of claim 8 wherein the given hidden layer further comprises a dropout layer configured to dropout random sets of activations in the convolutional layer. 11. The method of claim 8 wherein the given hidden layer further comprises a pooling layer configured to provide non-linear down-sampling of the output of the convolutional layer. 12. The method of claim 1 wherein the one or more classification layers comprise: a fully connected layer comprising neurons with connections to all activations in the at least one hidden layer; and an output layer comprising at least one neuron that generates a representation of a confidence level of the deep neural network in classifying the given software script as the given software script type. 13. The method of claim 12 wherein the designated software script type comprises malicious software scripts. 14. The method of claim 1 wherein modifying access by the given client device to the given software script comprises at least one of: removing the given software script from a memory or storage of the given client device; preventing the given client device from obtaining the given software script; and causing the given software script to be opened in a sandboxed application environment on the given client device. 15. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device cause the at least one processing device: to generate a tokenized representation of a given software script, the tokenized representation comprising two or more tokens representing two or more commands in the given software script; to map the tokens of the tokenized representation to a vector space providing contextual representation of the tokens utilizing an embedding layer of a deep learning network; to detect sequences of the mapped tokens representing sequences of commands associated with designated types of script behavior utilizing at least one hidden layer of the deep learning network; to classify the given software script based on the detected sequences of the mapped tokens utilizing one or more classification layers of the deep learning network; and to modify access by a given client device to the given software script responsive to classifying the given software script as a given softwa

Assignees

Inventors

Classifications

  • H04L63/02Primary

    for separating internal from external traffic, e.g. firewalls · CPC title

  • Active attacks involving interception, injection, modification, spoofing of data unit addresses, e.g. hijacking, packet injection or TCP sequence number attacks · CPC title

  • for managing network security; network security policies in general (filtering policies H04L63/0227) · CPC title

  • Traffic logging, e.g. anomaly detection · CPC title

  • Learning methods · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10581888B1 cover?
A method includes generating a tokenized representation of a given software script, the tokenized representation comprising two or more tokens representing two or more commands in the given software script. The method also includes mapping the tokens of the tokenized representation to a vector space providing contextual representation of the tokens utilizing an embedding layer of a deep learnin…
Who is the assignee on this patent?
Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification H04L63/02. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Mar 03 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).