System and method for file type identification using machine learning

US12436920B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12436920-B2
Application numberUS-202318449617-A
CountryUS
Kind codeB2
Filing dateAug 14, 2023
Priority dateOct 30, 2017
Publication dateOct 7, 2025
Grant dateOct 7, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method for file type identification involving extraction of a file-print of a file, the file-print being a unique or practically-unique representation of statistical characteristics associated with the distribution of bits in the binary contents of the file, similar to a fingerprint. The file-print is then passed to a machine learning algorithm that has been trained to recognize file types from their file-prints. The machine learning algorithm returns a predicted file type and, in some cases, a probability of correctness of the prediction. The file may then be encoded using an encoding algorithm chosen based on the predicted file type.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for identifying a file type comprising: a computing device comprising a processor, a memory, and a non-volatile data storage device; a file-print extractor comprising a first plurality of programming instructions stored in the memory and operable on the processor, wherein the first plurality of programming instructions, when operating on the processor, causes the computing device to: segment an entire file into groups of bytes; generate a statistical file-print for file type identification of the entire file, the statistical file-print comprising a plurality of statistical characteristics of a distribution of the groups of bytes across the entire file, wherein the statistical file-print comprises at least a mean value and a variance value for the distribution of bytes in the file; and a file classifier comprising a second plurality of programming instructions stored in the memory and operable on the processor, wherein the second plurality of programming instructions, when operating on the processor, causes the processor to; process the statistical file-print through a trained machine learning classifier to identify a file type of the file; wherein the trained machine learning classifier is specifically trained using a plurality of training datasets comprising statistical file-prints derived from files of known types; and wherein the trained machine learning classifier determines a file type based on statistical patterns in the file-print that correspond to patterns previously identified in files of known type during training. 2. The system of claim 1 , further comprising a codebook database stored on the non-volatile data storage device, the codebook database comprising a plurality of codebooks which may be used to encode or decode files, wherein the identification of the file type is used to select a codebook for encoding or decoding the file. 3. The system of claim 2 , wherein one or more encoding or decoding parameters are configured based on the identification of the file type. 4. The system of claim 2 , further comprising a file signature database stored on the non-volatile data storage device, the file signature database comprising known file signatures for a plurality of file types, wherein: the file is checked for a file signature prior to file-print extraction; if a file signature is found, the file signature is compared to known file signatures in the file signature database; and if the file signature matches a known file signature, a codebook is selected based on the file signature, and the file-print extractor and file classifier are instructed to cease operations on the file. 5. A method for identifying a file type comprising the steps of: segmenting an entire file into groups of bytes; generating a statistical file-print for file type identification of the entire file, the statistical file-print comprising a plurality of statistical characteristics of a distribution of the groups of bytes across the entire file, wherein the statistical file-print comprises at least a mean value and a variance value for the distribution of bytes in the file; and processing the statistical file-print through a trained machine learning classifier specifically trained to recognize file types based on statistical characteristics of entire files to identify a file type of the file. 6. The method of claim 5 , further comprising the steps of using the identification of the file type to select a codebook for encoding or decoding the file from a codebook database stored on the non-volatile data storage device, the codebook database comprising a plurality of codebooks which may be used to encode or decode files. 7. The method of claim 6 , further comprising the step of configuring one or more encoding or decoding parameters based on the identification of the file type. 8. The method of claim 6 , further comprising the steps of: checking the file for a file signature prior to file print extraction; if a file signature is found, comparing the file signature to known file signatures in a file signature database stored on the non-volatile data storage device, the file signature database comprising known file signatures for a plurality of file types; and if the file signature matches a known file signature, selecting a codebook based on the file signature, and instructing the file print extractor and file classifier to cease operations on the file.

Assignees

Inventors

Classifications

  • Saving storage space on storage systems · CPC title

  • De-duplication techniques · CPC title

  • Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title

  • Denial of Service · CPC title

  • according to the data type · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12436920B2 cover?
A system and method for file type identification involving extraction of a file-print of a file, the file-print being a unique or practically-unique representation of statistical characteristics associated with the distribution of bits in the binary contents of the file, similar to a fingerprint. The file-print is then passed to a machine learning algorithm that has been trained to recognize fi…
Who is the assignee on this patent?
Atombeam Technologies Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/1752. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 07 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).