Software code analysis using fuzzy fingerprinting

US11972256B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11972256-B2
Application numberUS-202217651270-A
CountryUS
Kind codeB2
Filing dateFeb 16, 2022
Priority dateFeb 16, 2022
Publication dateApr 30, 2024
Grant dateApr 30, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system for determining code ancestry. The system includes: a memory; and a processor communicatively coupled to the memory. The processor is configured to perform a method comprising: receiving a source code file; parsing a plurality of functions out of the source code file; generating fuzzy fingerprints from the plurality of functions; and storing the fuzzy fingerprints in a graph database.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for determining code ancestry, comprising: a memory; and a processor communicatively coupled to the memory, wherein the processor is configured to perform a method comprising: receiving a source code file; parsing a plurality of functions out of the source code file; sanitizing each of the plurality of functions by applying specific filtering to each of the plurality of functions, stripping unnecessary components from the source code file, and producing structure data and sanitized data; generating a plurality of fuzzy fingerprints from the plurality of functions, including generating separate fuzzy fingerprint signatures for both the structure data and the sanitized data of each of the plurality of functions that represent the plurality of functions; storing the plurality of fuzzy fingerprints in a graph database; comparing each of the plurality of fuzzy fingerprints to another one of the plurality of fuzzy fingerprints by applying fuzzy matching to determine a similarity; determining whether the plurality of fuzzy fingerprints include portions of the source code file that are a direct match, a variation, a derivative, or not a match by comparing the similarity to a threshold; and establishing an ancestry of the portions of the source code file based on the determining. 2. The system of claim 1 , wherein the sanitizing step includes stripping unnecessary components from the functions. 3. The system of claim 1 , wherein the processor is further configured to perform the method further comprising: using a graph-traversing algorithm to identify temporal and spatial relationships between the functions based on their corresponding fuzzy fingerprints in the graph database. 4. The system of claim 3 , wherein the processor is further configured to perform the method further comprising: storing the temporal and spatial relationships between the functions identified by the graph database in a secondary database. 5. The system of claim 2 , wherein the processor is further configured to perform the method further comprising: comparing the temporal and spatial relationships between the functions to the threshold to determine the similarity. 6. The system of claim 1 , wherein the parsing step includes recognizing a filetype and a programming language of the source code file. 7. The system of claim 1 , wherein the graph database is configured to generate a graph from temporal and spatial relationships between the fuzzy fingerprints using a graph-traversing algorithm. 8. A computer program product for software analysis using fuzzy fingerprinting to determine code ancestry, the computer program product comprising one or more computer readable storage media having program instructions embodied therewith, the program instructions executable by a device to cause the device to: receive a source code file of a computer program; parse a plurality of functions out of the source code file; sanitize each of the plurality of functions by applying specific filter ring to each of the plurality of functions, stripping unnecessary components from the source code file, and producing structure data and sanitized data; generate a plurality of fuzzy fingerprints from the plurality of functions, including generating separate fuzzy fingerprint signatures for both the structure data and the sanitized data of each of the plurality of functions that represent the plurality of functions; store the plurality of fuzzy fingerprints in a graph database; compare each of the plurality of fuzzy fingerprints to another one of the plurality of fuzzy fingerprints by applying fuzzy matching to determine a similarity; determine whether the plurality of fuzzy fingerprints include portions of the source code file that are a direct match, a variation, a derivative, or not a match by comparing the similarity to a threshold; and establish an ancestry of the portions of the source code file based on the determining. 9. The computer program product of claim 8 , wherein the program instructions cause the device to use a graph-traversing algorithm to identify temporal and spatial relationships between the functions based on their corresponding fuzzy fingerprints in the graph database. 10. The computer program product of claim 9 , wherein the program instructions cause the device to store the temporal and spatial relationships between the functions identified by the graph database in a secondary database. 11. The computer program product of claim 9 , wherein the program instructions cause the device to compare the temporal and spatial relationships between the functions to the threshold to determine the similarity. 12. A method for determining code ancestry, comprising: receiving a source code file; parsing a plurality of functions out of the source code file; sanitizing each of the plurality of functions by applying specific filtering to each of the plurality of functions, stripping unnecessary components from the source code file, and producing structure data and sanitized data; generating a plurality of fuzzy fingerprints from the plurality of functions, including generating separate fuzzy fingerprint signatures for both the structure data and the sanitized data of each of the plurality of functions that represent the plurality of functions; storing the plurality of fuzzy fingerprints in a graph database; comparing each of the plurality of fuzzy fingerprints to another one of the plurality of fuzzy fingerprints by applying fuzzy matching to determine a similarity; determining whether the plurality of fuzzy fingerprints include portions of the source code file that are a direct match, a variation, a derivative, or not a match by comparing the similarity to a threshold; and establishing an ancestry of the portions of the source code file based on the determining. 13. The method of claim 12 , wherein the sanitizing step includes stripping unnecessary components from the functions. 14. The method of claim 12 , further comprising: using a graph-traversing algorithm to identify temporal and spatial relationships between the functions based on their corresponding fuzzy fingerprints in the graph database. 15. The method of claim 14 , further comprising: storing the temporal and spatial relationships between the functions identified by the graph database in a secondary database. 16. The method of claim 14 , further comprising: comparing the temporal and spatial relationships between the functions to the threshold to determine the similarity. 17. The method of claim 12 , wherein the parsing step includes recognizing a filetype and a programming language of the source code file.

Assignees

Inventors

Classifications

  • G06F8/75Primary

    Structural analysis for program understanding · CPC title

  • G06F8/433Primary

    Dependency analysis; Data or control flow analysis · CPC title

  • Version control (security arrangements therefor G06F21/57); Configuration management · CPC title

  • Graphs; Linked lists (G06F16/9027 takes precedence) · CPC title

  • Matching criteria, e.g. proximity measures · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11972256B2 cover?
A system for determining code ancestry. The system includes: a memory; and a processor communicatively coupled to the memory. The processor is configured to perform a method comprising: receiving a source code file; parsing a plurality of functions out of the source code file; generating fuzzy fingerprints from the plurality of functions; and storing the fuzzy fingerprints in a graph database.
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F8/75. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 30 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).