Computer-Security Violation Detection using Coordinate Vectors
US-2020311262-A1 · Oct 1, 2020 · US
US12499228B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12499228-B2 |
| Application number | US-202118003318-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 21, 2021 |
| Priority date | Jun 24, 2020 |
| Publication date | Dec 16, 2025 |
| Grant date | Dec 16, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An APT organization identification method, system and storage medium based on a stacking ensemble are provided, the method comprising: using a TF-IDF algorithm combined with an n-gram to extract and vectorize behavior features from malware samples to form a malicious behavior vector feature set; based on the malicious behavior vector feature set, calculating correlations between features and chi-square values between the features and categories, performing screening twice on the malicious behavior vector feature set to obtain an improved low-dimensional feature subset data; constructing a multi-model fusion stacking ensemble, learning an APT organization identification model, using the APT organization identification model to perform an identification on new ATP attacks. The feature selection of high-dimensional behavior vector features reduces the complexity of the data set; the imbalance of samples in the data set is also considered, and multi-model integrated training to improve the recognition accuracy is adopted; in addition, the APT organization identification model for malicious samples is obtained through machine learning training, which improves the automatic identification efficiency of new sample is improved.
Opening claim text (preview).
The invention claimed is: 1 . An APT organization identification method based on a stacking ensemble, comprising the following steps: using a TF-IDF algorithm combined with an n-gram to extract and vectorize behavior features from malware samples to form a malicious behavior vector feature set; based on the malicious behavior vector feature set and an APT organization tag, calculating correlations between features and chi-square values between the features and categories, performing screening twice on the malicious behavior vector feature set to obtain an improved low-dimensional feature subset data; and constructing a multi-model fusion stacking ensemble, learning an APT organization identification model, using the APT organization identification model to perform an organization identification on new APT attacks; wherein calculating of the chi-square values between the features and the categories comprises: for S m , categories in a feature subset S, calculating a chi-square value of each feature in each category, and arranging the features in a descending order of the chi-square values; from feature sets of each category, selecting top N feature text and put them into a new feature subset S′; keeping one of the repeated features in S′, and deleting the rest; and outputting the new feature subset S′. 2 . The APT organization identification method based on a stacking ensemble according to claim 1 , wherein using the TF-IDF algorithm combined with the n-gram to extract and vectorize the behavior features from the malware samples to form a behavior data set comprises: for behavior text features of the malware samples, first generating n-gram texts, then counting a text frequency TF of each text separately, then attaching a weight parameter IDF to each text; TF i , j = n i , j ∑ k n k , j wherein TF i,j : a frequency of a text i in a sample j; n i,j : a number of times the text i appears in the sample j; Σ k n k,j : a total number of text appearing in the sample j; then calculating the weight parameters: IDF i , j = log ❘ "\[LeftBracketingBar]" D ❘ "\[RightBracketingBar]" ❘ "\[LeftBracketingBar]" j : i ∈ d j ❘ "\[RightBracketingBar]" + 1 wherein |D| represents a total number of samples, |j:i∈d j | represents a number of samples containing the text i, in order to prevent a denominator from being zero, 1 is added, a final weight calculation formula of each text is: TF−IDF i,j =TF i,j ×IDF i,j and through a TF-IDF method for calculating malicious sample behavior feature text combined with the n-gram, preprocessing data, calculating text frequency features, performing a feature vectorization on behavior text data to form a semantic matrix, forming the malicious behavior vector feature set. 3 . The APT organization identification method based on a stacking ensemble according to claim 1 , wherein feature data extracted by the n-gram combined with a TF-IDF method comprises more feature attributes, first, performing a first primary selection of the malicious behavior vector feature set, calculating correlations between features and features, and filtering out features with information redundancy between features. 4 . The APT organization identification method based on a stacking ensemble according to claim 3 , wherein calculating the correlations between features and features comprises: inputting: a behavior vector dataset F, a number of features F n , and thresholds ε 1 ,ε 2 ; randomly selecting a feature X 1 , calculating its information entropy H (X 1 ), if H(X 1 )>ε 1 is satisfied, then adding it to a feature set to be selected S, otherwise, continue to select; for i=2, . . . , F n calculating an information entropy of the feature X i , if H(X i )>ε 1 is satisfied, then judging correlations between this feature and all other features X j in S: ρ X i , X j = cov ( X i , X j ) σ X i σ X j = E ( ( X i - μ x i )
Test or assess software · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
relating to the classification model, e.g. parametric or non-parametric approaches · CPC title
by source code analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.