Representing and comparing files based on segmented similarity

US2017193230A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2017193230-A1
Application numberUS-201514702750-A
CountryUS
Kind codeA1
Filing dateMay 3, 2015
Priority dateMay 3, 2015
Publication dateJul 6, 2017
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed herein is a system and method for determining whether two files are similar or an unknown file contains malware or other malicious activity. The system takes a suspect file and generates a hash for the file. The hash represents segments of a file that may be compared with segments of other hashes. This hash is then compared with the hash of another file. The comparison measures the distance between the two hashes and if the two hashes are close enough to each other then the two files are consider similar to each other.

First claim

Opening claim text (preview).

1 . A system for determining similarity between two files comprising: at least one processor and at least one memory device; a representation component configured to receive a file and generate a hash of the file, the hash including a list of transitions and a list of levels; and a distance component configured to determine a distance between the received file and a second file based on a comparison of the hash and a hash for the second file. 2 . The system of claim 1 wherein the representation component further comprises: a preprocessing component, the preprocessing component configured to convert the file to a signal representative of the file. 3 . The system of claim 2 wherein the preprocessing component applies a Huffman code to the file to generate the signal. 4 . The system of claim 1 wherein the representation component further comprises: a segmentation component configured to divide a signal associated with the file into at least two segments and provided the segments as the list of transitions. 5 . The system of claim 4 wherein the segmentation component is configured to identify a transition point, the transition point representative of a boundary between two segments. 6 . The system of claim 4 wherein the segmentation component is further configured to generate a first window having a first size and a second window having a second size, the segmentation component further configured to place the first window at a first byte in the signal and place the second window at a byte following a last byte of the first window. 7 . The system of claim 6 wherein the segmentation component is further configured to calculate a first statistical property for the first window and calculate a second statistical property for the second window and compare the first statistical property with the second statistical property and determine if a difference between the first statistical property and the second statistical property exceeds a threshold value. 8 . The system of claim 7 wherein the segmentation component is further configured to enlarge the size of the first window when the difference does not exceed the threshold and move the second window to a location following the last byte of the enlarged first window. 9 . The system of claim 1 wherein the representation component further comprises: a represent component configured to identify a statistical property for each transition in the list of transitions. 10 . The system of claim 1 wherein the distance component is further configured to calculate the distance based on a calculated area between segments of the hash and segments of the hash of the second file. 11 . The system of claim 10 wherein the distance component is further configured to calculate a structural distance between the hash and the hash of the second file. 12 . The system of claim 11 wherein the distance component applies a weighting factor to the structural distance. 13 . A method of generating a hash for a file comprising: receiving a file; preprocessing the file to convert the file to a signal representative of the bytes in the file; identifying a list of segments in the preprocessed file based on statistical property differences with other portions of the preprocessed file; representing the preprocessed file by generating a level value for each segment in the list of segments as a list of levels; and generating a hash of the file, wherein the hash comprises the list of segments and the list of levels. 14 . The method of claim 13 wherein identifying the list of segments further comprises: determining a size of a first window; placing the first window on a first byte of the preprocessed file; placing a second window at a first byte position after an end byte of the first window; calculating a first statistical property for the first window and a second statistical property for the second window; and determining if a difference between the first statistical property and the second statistical property exceeds a threshold value; and noting as a transition point the end byte when the difference exceeds the threshold value. 15 . The method of claim 14 , when the difference does not exceed the threshold value, further comprising: increasing the size of the first window; moving the second window to the first byte position after a new end byte of the first window; and repeating the steps of calculating, determining and noting. 16 . The method of claim 14 when the difference exceeds the threshold value, further comprising: moving the first window to the first byte position of the second window; resetting the size of the first window to an original size; and repeating the steps of placing, calculating, determining and noting for the first window and the second window for the new location. 17 . The method of claim 13 wherein the level value is generated by calculating a statistical property for each segment in the list of segments. 18 . A computer readable storage device having computer executable instructions that when executed by at least one computer cause the at least one computer to: receive a hash of a file to analyze; obtain a second hash, the second hash representative of a second file to compare with the file; determine an area between the hash and the second hash; determine a structural distance between the hash and the second hash; calculate a distance between the hash and the second hash based on the area and the structural distance; determine if the two hashes are similar or dissimilar based on a comparison of the calculated distance to a threshold value. 19 . The computer readable storage device of claim 18 wherein calculate the distance between the hash and the second hash further comprises instructions to applying a weighting factor to the structural distance. 20 . The computer readable storage device of claim 18 wherein receive a hash of a file further comprises instructions to: receive the file; provide the file to a representation component; and receive from the representation component a hash of the file.

Assignees

Inventors

Classifications

  • Test or assess a computer or a system · CPC title

  • G06F21/565Primary

    by checking file integrity · CPC title

  • G06F21/564Primary

    by virus signature recognition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017193230A1 cover?
Disclosed herein is a system and method for determining whether two files are similar or an unknown file contains malware or other malicious activity. The system takes a suspect file and generates a hash for the file. The hash represents segments of a file that may be compared with segments of other hashes. This hash is then compared with the hash of another file. The comparison measures the di…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F21/565. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jul 06 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).