Automatically generating malware definitions using word-level analysis
US-11222113-B1 · Jan 11, 2022 · US
US11853415B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11853415-B1 |
| Application number | US-202017116419-A |
| Country | US |
| Kind code | B1 |
| Filing date | Dec 9, 2020 |
| Priority date | Dec 12, 2019 |
| Publication date | Dec 26, 2023 |
| Grant date | Dec 26, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed herein are methods, systems, and processes for context-based identification of anomalous log data. Log data with multiple original logs is received at an anomalous log data identification system. A context associated training dataset is generated by splitting a string in a log into multiple split strings, generating a context association between each split string and a unique key that corresponds to the log, and generating an input/output (I/O) string data batch that includes I/O string data for each split string in the log by training each split string against every other split string in the log. A context-based anomalous log data identification model is then trained according to a machine learning technique using the I/O string data batch that includes a list of unique strings in the context associated training dataset. The training tunes the context-based anomalous log data identification model to classify or cluster a vector associated with a new string in a new log that is not part of the multiple original logs as anomalous.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: performing, by one or more hardware processors with associated memory that implement a context-based anomalous log data identification system: receiving log data comprising a plurality of logs; generating a context associated training dataset, comprising splitting a string in a log of the plurality of logs into a plurality of split strings, generating a context association between each of the plurality of split strings and a unique key that corresponds to the log, and generating an input/output (I/O) string data batch comprising I/O string data for each split string in the log by training each split string against every other split string of the plurality of split strings in the log; and training a context-based anomalous log data identification model using the I/O string data batch comprising a list of unique strings in the context associated training dataset and according to a machine learning technique, wherein the training tunes the context-based anomalous log data identification model to classify or cluster a vector associated with a new string in a new log that is not part of the plurality of logs as anomalous, training the context-based anomalous log data identification model to perform cluster analysis is based on whether an executable that is part of the process information is a good executable that is part of a bad path, and the good executable and the bad path are pre-identified based at least on a classifier prior to performing the cluster analysis. 2. The computer-implemented method of claim 1 , further comprising: generating a dense vector for the log. 3. The computer-implemented method of claim 2 , wherein generating the dense vector for the log comprises: accessing the list of unique split strings, and averaging a plurality of vectors comprising at least one vector for each unique split string in the list of unique split strings, and the dense vector indicates a mapping of each unique split string in the list of unique split strings to the dense vector being trained. 4. The computer-implemented method of claim 3 , further comprising: training the context-based anomalous log data identification model with additional I/O string data generated by the context-based anomalous log data identification system for each log of the plurality of logs. 5. The computer-implemented method of claim 1 , wherein the log data comprises process information associated with one or more computing systems generating the log data, and the process information comprises a plurality of process names/hashes. 6. The computer-implemented method of claim 5 , wherein training the context-based anomalous log data identification model to perform cluster analysis is based at least on a number of occurrences of a process name/hash of the plurality of process names/hashes in the log. 7. A non-transitory computer readable storage medium comprising program instructions executable to: perform, by one or more hardware processors with associated memory that implement a context-based anomalous log data identification system: receive log data comprising a plurality of logs; generate a context associated training dataset, comprising splitting a string in a log of the plurality of logs into a plurality of split strings, generating a context association between each of the plurality of split strings and a unique key that corresponds to the log, and generating an input/output (I/O) string data batch comprising I/O string data for each split string in the log by training each split string against every other split string of the plurality of split strings in the log; and train a context-based anomalous log data identification model using the I/O string data batch comprising a list of unique strings in the context associated training dataset and according to a machine learning technique, wherein the training tunes the context-based anomalous log data identification model to classify or cluster a vector associated with a new string in a new log that is not part of the plurality of logs as anomalous, training the context-based anomalous log data identification model to perform cluster analysis is based on whether an executable that is part of the process information is a good executable that is part of a bad path, and the good executable and the bad path are pre-identified based at least on a classifier prior to performing the cluster analysis. 8. The non-transitory computer readable storage medium of claim 7 , further comprising: generating a dense vector for the log. 9. The non-transitory computer readable storage medium of claim 8 , wherein generating the dense vector for the log comprises: accessing the list of unique split strings, and averaging a plurality of vectors comprising at least one vector for each unique split string in the list of unique split strings, and the dense vector indicates a mapping of each unique split string in the list of unique split strings to the dense vector being trained. 10. The non-transitory computer readable storage medium of claim 9 , further comprising: training the context-based anomalous log data identification model with additional I/O string data generated by the context-based anomalous log data identification system for each log of the plurality of logs. 11. The non-transitory computer readable storage medium of claim 7 , wherein the log data comprises process information associated with one or more computing systems generating the log data, and the process information comprises a plurality of process names/hashes. 12. The non-transitory computer readable storage medium of claim 11 , wherein training the context-based anomalous log data identification model to perform cluster analysis is further based at least on a number of occurrences of a process name/hash of the plurality of process names/hashes in the log. 13. A system comprising: one or more processors; and a memory coupled to the one or more processors, wherein the memory stores program instructions executable by the one or more processors to: perform, by one or more hardware processors with associated memory that implement a context-based anomalous log data identification system: receive log data comprising a plurality of logs; generate a context associated training dataset, comprising splitting a string in a log of the plurality of logs into a plurality of split strings, generating a context association between each of the plurality of split strings and a unique key that corresponds to the log, and generating an input/output (I/O) string data batch comprising I/O string data for each split string in the log by training each split string against every other split string of the plurality of split strings in the log; and train a context-based anomalous log data identification model using the I/O string data batch comprising a list of unique strings in the context associated training dataset and according to a machine learning technique, wherein the training tunes the context-based anomalous log data identification model to classify or cluster a vector associated with a new string in a new log that is not part of the plurality of logs as anomalous, training the context-based anomalous log data identification model to perform cluster analysis is based on whether an executable that is part of the process information is a good executable that is part of a bad path, and the good executable and the bad path are pre-identified based at least on a classifier prior to performing the cluster analysis. 14. The system of claim 13 , fu
based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Transformation · CPC title
by using string matching techniques · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.