Stacking-ensemble-based APT organization identification method and system, and storage medium

US12499228B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12499228-B2
Application numberUS-202118003318-A
CountryUS
Kind codeB2
Filing dateJun 21, 2021
Priority dateJun 24, 2020
Publication dateDec 16, 2025
Grant dateDec 16, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An APT organization identification method, system and storage medium based on a stacking ensemble are provided, the method comprising: using a TF-IDF algorithm combined with an n-gram to extract and vectorize behavior features from malware samples to form a malicious behavior vector feature set; based on the malicious behavior vector feature set, calculating correlations between features and chi-square values between the features and categories, performing screening twice on the malicious behavior vector feature set to obtain an improved low-dimensional feature subset data; constructing a multi-model fusion stacking ensemble, learning an APT organization identification model, using the APT organization identification model to perform an identification on new ATP attacks. The feature selection of high-dimensional behavior vector features reduces the complexity of the data set; the imbalance of samples in the data set is also considered, and multi-model integrated training to improve the recognition accuracy is adopted; in addition, the APT organization identification model for malicious samples is obtained through machine learning training, which improves the automatic identification efficiency of new sample is improved.

First claim

Opening claim text (preview).

The invention claimed is: 1 . An APT organization identification method based on a stacking ensemble, comprising the following steps: using a TF-IDF algorithm combined with an n-gram to extract and vectorize behavior features from malware samples to form a malicious behavior vector feature set; based on the malicious behavior vector feature set and an APT organization tag, calculating correlations between features and chi-square values between the features and categories, performing screening twice on the malicious behavior vector feature set to obtain an improved low-dimensional feature subset data; and constructing a multi-model fusion stacking ensemble, learning an APT organization identification model, using the APT organization identification model to perform an organization identification on new APT attacks; wherein calculating of the chi-square values between the features and the categories comprises: for S m , categories in a feature subset S, calculating a chi-square value of each feature in each category, and arranging the features in a descending order of the chi-square values; from feature sets of each category, selecting top N feature text and put them into a new feature subset S′; keeping one of the repeated features in S′, and deleting the rest; and outputting the new feature subset S′. 2 . The APT organization identification method based on a stacking ensemble according to claim 1 , wherein using the TF-IDF algorithm combined with the n-gram to extract and vectorize the behavior features from the malware samples to form a behavior data set comprises: for behavior text features of the malware samples, first generating n-gram texts, then counting a text frequency TF of each text separately, then attaching a weight parameter IDF to each text; TF i , j = n i , j ∑ k ⁢ n k , j wherein TF i,j : a frequency of a text i in a sample j; n i,j : a number of times the text i appears in the sample j; Σ k n k,j : a total number of text appearing in the sample j; then calculating the weight parameters: IDF i , j = log ⁢ ❘ "\[LeftBracketingBar]" D ❘ "\[RightBracketingBar]" ❘ "\[LeftBracketingBar]" j : i ∈ d j ❘ "\[RightBracketingBar]" + 1 wherein |D| represents a total number of samples, |j:i∈d j | represents a number of samples containing the text i, in order to prevent a denominator from being zero, 1 is added, a final weight calculation formula of each text is: TF−IDF i,j =TF i,j ×IDF i,j and through a TF-IDF method for calculating malicious sample behavior feature text combined with the n-gram, preprocessing data, calculating text frequency features, performing a feature vectorization on behavior text data to form a semantic matrix, forming the malicious behavior vector feature set. 3 . The APT organization identification method based on a stacking ensemble according to claim 1 , wherein feature data extracted by the n-gram combined with a TF-IDF method comprises more feature attributes, first, performing a first primary selection of the malicious behavior vector feature set, calculating correlations between features and features, and filtering out features with information redundancy between features. 4 . The APT organization identification method based on a stacking ensemble according to claim 3 , wherein calculating the correlations between features and features comprises: inputting: a behavior vector dataset F, a number of features F n , and thresholds ε 1 ,ε 2 ; randomly selecting a feature X 1 , calculating its information entropy H (X 1 ), if H(X 1 )>ε 1 is satisfied, then adding it to a feature set to be selected S, otherwise, continue to select; for i=2, . . . , F n calculating an information entropy of the feature X i , if H(X i )>ε 1 is satisfied, then judging correlations between this feature and all other features X j in S: ρ X i , X j = cov ⁡ ( X i , X j ) σ X i ⁢ σ X j = E ( ( X i - μ x i ) ⁢

Assignees

Inventors

Classifications

  • Test or assess software · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

  • relating to the classification model, e.g. parametric or non-parametric approaches · CPC title

  • G06F21/563Primary

    by source code analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12499228B2 cover?
An APT organization identification method, system and storage medium based on a stacking ensemble are provided, the method comprising: using a TF-IDF algorithm combined with an n-gram to extract and vectorize behavior features from malware samples to form a malicious behavior vector feature set; based on the malicious behavior vector feature set, calculating correlations between features and ch…
Who is the assignee on this patent?
Univ Guangzhou
What technology area does this patent fall under?
Primary CPC classification G06F21/563. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 16 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).