Systems and methods for contrastive graphing
US-2024160890-A1 · May 16, 2024 · US
US2026003825A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2026003825-A1 |
| Application number | US-202418756540-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jun 27, 2024 |
| Priority date | Jun 27, 2024 |
| Publication date | Jan 1, 2026 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for detecting file similarity based on the characteristics and semantics of files are disclosed. A machine learning (ML) model may be trained to recognize and group files based on a hierarchy of file characteristics. The trained ML model may be used to process a set of files to generate a feature vector database comprising a set of feature vectors that are grouped based on the hierarchy of characteristics. In response to receiving a query file to be compared to the set of files, the ML model may be used to process the query file to generate a query feature vector. The query feature vector may be used to search the feature vector database to identify feature vectors that are similar to the query feature vector. A file corresponding to each feature vector that is similar to the query feature vector may be retrieved and presented to a user.
Opening claim text (preview).
1 . A method comprising: training, over a plurality of steps, a machine learning (ML) model to group files based on a hierarchy of characteristics, wherein at each of the plurality of steps, the ML model is trained to group files iteratively, and wherein at each iteration the ML model learns to group files based on a particular characteristic from the hierarchy of characteristics; providing to a machine learning (ML) model, a set of files, wherein the ML model is configured to generate, based on the set of files, a feature vector database comprising a set of feature vectors, wherein each of the set of feature vectors corresponds to a particular file of the set of files and wherein the set of feature vectors is grouped based on the hierarchy of characteristics; in response to receiving a query file to be compared to the set of files, processing the query file using the ML model to generate a query feature vector; and querying, by a processing device, the feature vector database using the query feature vector to identify one or more of the set of files that are similar to the query file. 2 . The method of claim 1 , wherein the ML model is trained using training data comprising a plurality of training data batches, wherein each of the plurality of training data batches comprises a set of training files with a label for each characteristic in the hierarchy of characteristics, and wherein training the ML model comprises: at each of the plurality of steps: grouping, using the ML model, a respective training data batch iteratively based on the hierarchy of characteristics to generate an output for each iteration; and for each iteration: analyzing the output with a hierarchical contrastive learning (HCL) loss function to determine a loss value; and adjusting one or more weights of the ML model based at least in part on the loss value. 3 . The method of claim 2 , wherein training the ML model further comprises: for each iteration: analyzing the output with a focal loss function to determine a second loss value; and adding the loss value and the second loss value to generate a total loss value, wherein the one or more weights of the ML model are adjusted based on the total loss value. 4 . The method of claim 1 , wherein querying the feature vector database using the query feature vector comprises: using a nearest neighbors algorithm to identify from the feature vector database, one or more of the set of feature vectors that are similar to the query feature vector. 5 . The method of claim 4 , further comprising: for each of the identified one or more feature vectors, retrieving a file from the set of files corresponding to the identified feature vector to obtain the one or more of the set of files that are similar to the query file; and providing the one or more of the set of files that are similar to the query file as a result set. 6 . The method of claim 1 , wherein the hierarchy of characteristics comprises: threat type, malware family, subtype, compiler, packer, and library. 7 . The method of claim 1 , wherein each of the set of files and the query file are portable executable files. 8 . A system comprising: a memory; and a processing device operatively coupled to the memory, the processing device to: train, over a plurality of steps, a machine learning (ML) model to group files based on a hierarchy of characteristics, wherein at each of the plurality of steps, the ML model is trained to group files iteratively, and wherein at each iteration the ML model learns to group files based on a particular characteristic from the hierarchy of characteristics; provide to a machine learning (ML) model, a set of files, wherein the ML model is configured to generate, based on the set of files, a feature vector database comprising a set of feature vectors, wherein each of the set of feature vectors corresponds to a particular file of the set of files and wherein the set of feature vectors is grouped based on a hierarchy of characteristics; in response to receiving a query file to be compared to the set of files, process the query file using the ML model to generate a query feature vector; and query the feature vector database using the query feature vector to identify one or more of the set of files that are similar to the query file. 9 . The system of claim 8 , wherein the processing device trains the ML model using training data comprising a plurality of training data batches, wherein each of the plurality of training data batches comprises a set of training files with a label for each characteristic in the hierarchy of characteristics, and wherein to train the ML model, the processing device is to: at each of the plurality of steps: group, using the ML model, a respective training data batch iteratively based on the hierarchy of characteristics to generate an output for each iteration; and for each iteration: analyze the output with a hierarchical contrastive learning (HCL) loss function to determine a loss value; and adjust one or more weights of the ML model based at least in part on the loss value. 10 . The system of claim 9 , wherein to train the ML model, the processing device is further to: for each iteration: analyze the output with a focal loss function to determine a second loss value; and add the loss value and the second loss value to generate a total loss value, wherein the one or more weights of the ML model are adjusted based on the total loss value. 11 . The system of claim 8 , wherein to query the feature vector database using the query feature vector, the processing device is to: use a nearest neighbors algorithm to identify from the feature vector database, one or more of the set of feature vectors that are similar to the query feature vector. 12 . The system of claim 11 , wherein the processing device is further to: for each of the identified one or more feature vectors, retrieve a file from the set of files corresponding to the identified feature vector to obtain the one or more of the set of files that are similar to the query file; and provide the one or more of the set of files that are similar to the query file as a result set. 13 . The system of claim 8 , wherein the hierarchy of characteristics comprises: threat type, malware family, subtype, compiler, packer, and library. 14 . The system of claim 8 , wherein each of the set of files and the query file are portable executable files. 15 . A non-transitory computer-readable medium having instructions stored thereon which, when executed by a processing device, cause the processing device to: train, over a plurality of steps, a machine learning (ML) model to group files based on a hierarchy of characteristics, wherein at each of the plurality of steps, the ML model is trained to group files iteratively, and wherein at each iteration the ML model learns to group files based on a particular characteristic from the hierarchy of characteristics; provide to a machine learning (ML) model, a set of files, wherein the ML model is configured to generate, based on the set of files, a feature vector database comprising a set of feature vectors, wherein each of the set of feature vectors corresponds to a particular file of the set of files and wherein the set of feature vectors is grouped based on a hierarchy of characteristics; in response to receiving a query file to be compared to the set of files, process the query file using the ML model to generate a query feature vector; and query, by the processing device, the feature vector database using the query feature vector to identify one or more of
Machine learning · CPC title
File search processing · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.