What technology area does this patent fall under?

Primary CPC classification G06F16/148. Mapped technology areas include Physics.

When was this patent published?

Publication date Thu Jan 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Techniques for detecting file similarity

US2026003825A1 · US · A1

Patent metadata
Field	Value
Publication number	US-2026003825-A1
Application number	US-202418756540-A
Country	US
Kind code	A1
Filing date	Jun 27, 2024
Priority date	Jun 27, 2024
Publication date	Jan 1, 2026
Grant date	—

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for detecting file similarity based on the characteristics and semantics of files are disclosed. A machine learning (ML) model may be trained to recognize and group files based on a hierarchy of file characteristics. The trained ML model may be used to process a set of files to generate a feature vector database comprising a set of feature vectors that are grouped based on the hierarchy of characteristics. In response to receiving a query file to be compared to the set of files, the ML model may be used to process the query file to generate a query feature vector. The query feature vector may be used to search the feature vector database to identify feature vectors that are similar to the query feature vector. A file corresponding to each feature vector that is similar to the query feature vector may be retrieved and presented to a user.

First claim

Opening claim text (preview).

1 . A method comprising: training, over a plurality of steps, a machine learning (ML) model to group files based on a hierarchy of characteristics, wherein at each of the plurality of steps, the ML model is trained to group files iteratively, and wherein at each iteration the ML model learns to group files based on a particular characteristic from the hierarchy of characteristics; providing to a machine learning (ML) model, a set of files, wherein the ML model is configured to generate, based on the set of files, a feature vector database comprising a set of feature vectors, wherein each of the set of feature vectors corresponds to a particular file of the set of files and wherein the set of feature vectors is grouped based on the hierarchy of characteristics; in response to receiving a query file to be compared to the set of files, processing the query file using the ML model to generate a query feature vector; and querying, by a processing device, the feature vector database using the query feature vector to identify one or more of the set of files that are similar to the query file. 2 . The method of claim 1 , wherein the ML model is trained using training data comprising a plurality of training data batches, wherein each of the plurality of training data batches comprises a set of training files with a label for each characteristic in the hierarchy of characteristics, and wherein training the ML model comprises: at each of the plurality of steps: grouping, using the ML model, a respective training data batch iteratively based on the hierarchy of characteristics to generate an output for each iteration; and for each iteration: analyzing the output with a hierarchical contrastive learning (HCL) loss function to determine a loss value; and adjusting one or more weights of the ML model based at least in part on the loss value. 3 . The method of claim 2 , wherein training the ML model further comprises: for each iteration: analyzing the output with a focal loss function to determine a second loss value; and adding the loss value and the second loss value to generate a total loss value, wherein the one or more weights of the ML model are adjusted based on the total loss value. 4 . The method of claim 1 , wherein querying the feature vector database using the query feature vector comprises: using a nearest neighbors algorithm to identify from the feature vector database, one or more of the set of feature vectors that are similar to the query feature vector. 5 . The method of claim 4 , further comprising: for each of the identified one or more feature vectors, retrieving a file from the set of files corresponding to the identified feature vector to obtain the one or more of the set of files that are similar to the query file; and providing the one or more of the set of files that are similar to the query file as a result set. 6 . The method of claim 1 , wherein the hierarchy of characteristics comprises: threat type, malware family, subtype, compiler, packer, and library. 7 . The method of claim 1 , wherein each of the set of files and the query file are portable executable files. 8 . A system comprising: a memory; and a processing device operatively coupled to the memory, the processing device to: train, over a plurality of steps, a machine learning (ML) model to group files based on a hierarchy of characteristics, wherein at each of the plurality of steps, the ML model is trained to group files iteratively, and wherein at each iteration the ML model learns to group files based on a particular characteristic from the hierarchy of characteristics; provide to a machine learning (ML) model, a set of files, wherein the ML model is configured to generate, based on the set of files, a feature vector database comprising a set of feature vectors, wherein each of the set of feature vectors corresponds to a particular file of the set of files and wherein the set of feature vectors is grouped based on a hierarchy of characteristics; in response to receiving a query file to be compared to the set of files, process the query file using the ML model to generate a query feature vector; and query the feature vector database using the query feature vector to identify one or more of the set of files that are similar to the query file. 9 . The system of claim 8 , wherein the processing device trains the ML model using training data comprising a plurality of training data batches, wherein each of the plurality of training data batches comprises a set of training files with a label for each characteristic in the hierarchy of characteristics, and wherein to train the ML model, the processing device is to: at each of the plurality of steps: group, using the ML model, a respective training data batch iteratively based on the hierarchy of characteristics to generate an output for each iteration; and for each iteration: analyze the output with a hierarchical contrastive learning (HCL) loss function to determine a loss value; and adjust one or more weights of the ML model based at least in part on the loss value. 10 . The system of claim 9 , wherein to train the ML model, the processing device is further to: for each iteration: analyze the output with a focal loss function to determine a second loss value; and add the loss value and the second loss value to generate a total loss value, wherein the one or more weights of the ML model are adjusted based on the total loss value. 11 . The system of claim 8 , wherein to query the feature vector database using the query feature vector, the processing device is to: use a nearest neighbors algorithm to identify from the feature vector database, one or more of the set of feature vectors that are similar to the query feature vector. 12 . The system of claim 11 , wherein the processing device is further to: for each of the identified one or more feature vectors, retrieve a file from the set of files corresponding to the identified feature vector to obtain the one or more of the set of files that are similar to the query file; and provide the one or more of the set of files that are similar to the query file as a result set. 13 . The system of claim 8 , wherein the hierarchy of characteristics comprises: threat type, malware family, subtype, compiler, packer, and library. 14 . The system of claim 8 , wherein each of the set of files and the query file are portable executable files. 15 . A non-transitory computer-readable medium having instructions stored thereon which, when executed by a processing device, cause the processing device to: train, over a plurality of steps, a machine learning (ML) model to group files based on a hierarchy of characteristics, wherein at each of the plurality of steps, the ML model is trained to group files iteratively, and wherein at each iteration the ML model learns to group files based on a particular characteristic from the hierarchy of characteristics; provide to a machine learning (ML) model, a set of files, wherein the ML model is configured to generate, based on the set of files, a feature vector database comprising a set of feature vectors, wherein each of the set of feature vectors corresponds to a particular file of the set of files and wherein the set of feature vectors is grouped based on a hierarchy of characteristics; in response to receiving a query file to be compared to the set of files, process the query file using the ML model to generate a query feature vector; and query, by the processing device, the feature vector database using the query feature vector to identify one or more of

Assignees

Crowdstrike Inc

Inventors

Classifications

G06N20/00
Machine learning · CPC title
G06F16/148Primary
File search processing · CPC title

Patent family

Related publications grouped by family.

View patent family 98368021

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2026003825A1 cover?: Techniques for detecting file similarity based on the characteristics and semantics of files are disclosed. A machine learning (ML) model may be trained to recognize and group files based on a hierarchy of file characteristics. The trained ML model may be used to process a set of files to generate a feature vector database comprising a set of feature vectors that are grouped based on the hierar…
Who is the assignee on this patent?: Crowdstrike Inc
What technology area does this patent fall under?: Primary CPC classification G06F16/148. Mapped technology areas include Physics.
When was this patent published?: Publication date Thu Jan 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Systems and methods for contrastive graphing

Machine learning techniques for generating string-based database mapping prediction

Searching for Music

Analytics based on scalable hierarchical categorization of web content

Mitigation of malware

Frequently asked questions