Systems and methods for detecting code duplication in codebases

US2023185550A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023185550-A1
Application numberUS-202218064620-A
CountryUS
Kind codeA1
Filing dateDec 12, 2022
Priority dateDec 13, 2021
Publication dateJun 15, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for detecting code duplication are disclosed. In one embodiment, a method for detecting exact code snippet duplicates may include: (1) representing, by a code duplication detection computer program, each of a plurality of code snippets in a codebase as an abstract syntax trees; (2) featurizing, by the code duplication detection computer program, the abstract syntax trees into corpus feature vectors by converting the abstract syntax tree into vector representations; (3) generating, by the code duplication detection computer program, dense feature vectors from the corpus feature vectors using a dimension reduction technique; (4) identifying, by the code duplication detection computer program, exact duplicate code snippet matches by apply density-based clustering to the dense feature vectors; and (5) tagging, by the code duplication detection computer program, the exact duplicate code snippets.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for detecting exact code snippet duplicates, comprising: representing, by a code duplication detection computer program, each of a plurality of code snippets in a codebase as an abstract syntax trees; featurizing, by the code duplication detection computer program, the abstract syntax trees into corpus feature vectors by converting the abstract syntax tree into vector representations; generating, by the code duplication detection computer program, dense feature vectors from the corpus feature vectors using a dimension reduction technique; identifying, by the code duplication detection computer program, exact duplicate code snippet matches by apply density-based clustering to the dense feature vectors; and tagging, by the code duplication detection computer program, the exact duplicate code snippets. 2 . The method of claim 1 , further comprising: applying, by the code duplication detection computer program, Natural Language Processing (NLP) to generate features for the abstract syntax trees. 3 . The method of claim 1 , further comprising: applying, by the code duplication detection computer program, a de-noising filter to the plurality of code snippets or the abstract syntax trees. 4 . The method of claim 3 , wherein the de-noising filter filters code snippets or abstract syntax trees that are not actively used. 5 . The method of claim 3 , wherein the de-noising filter filters code snippets or abstract syntax trees that are irrelevant. 6 . The method of claim 3 , wherein the de-noising filter is based on a trained neural network. 7 . The method of claim 1 , wherein the corpus feature vectors comprise a list of featurized abstract syntax trees from a code corpus. 8 . The method of claim 1 , wherein the dimension reduction technique comprises truncated Singular Value Decomposition. 9 . A method for detecting near code snippet duplicates, comprising: representing, by a code duplication detection computer program, each of a plurality of code snippets in a codebase as an abstract syntax trees; featurizing, by the code duplication detection computer program, the abstract syntax trees into corpus feature vectors by converting the abstract syntax tree into vector representations; generating, by the code duplication detection computer program, dense feature vectors from the corpus feature vectors using a dimension reduction technique; clustering, by the code duplication detection computer program, the dense feature vectors into dendrograms, each dendrogram having a different value for a cluster distance metric; applying, by the code duplication detection computer program, cross-correlation thresholding to identify an optimal value for the cluster distance metric; applying, by the code duplication detection computer program, iterative density-based clustering to the dendrogram for the optimal value for the cluster distance metric; tracking, by the code duplication detection computer program, data points in the dendrogram that have merged into a large cluster but were also present in small unique clusters, wherein the data points belonging to the same unique small cluster identify code snippets that are near duplicates of each other; and tagging, by the code duplication detection computer program, the near duplicate code snippets. 10 . The method of claim 8 , further comprising: applying, by the code duplication detection computer program, Natural Language Processing (NLP) to generate features for the abstract syntax trees. 11 . The method of claim 9 , further comprising: applying, by the code duplication detection computer program, a de-noising filter to the plurality of code snippets or the abstract syntax trees. 12 . The method of claim 11 , wherein the de-noising filter filters code snippets or abstract syntax trees that are not actively used. 13 . The method of claim 11 , wherein the de-noising filter is based on a trained neural network. 14 . The method of claim 9 , wherein the corpus feature vectors comprise a list of featurized abstract syntax trees from a code corpus. 15 . The method of claim 9 , wherein the dimension reduction technique comprises truncated Singular Value Decomposition. 16 . A method for detecting exact code snippet duplicates, comprising: loading, by a code duplication detection computer program, a near duplicate centroid, an exact duplicate centroid, a vectorizer, and a dimension reduction model; producing, by the code duplication detection computer program, dense vectors from incremental functions; representing, by a code duplication detection computer program, each of a plurality of incremental functions as an abstract syntax trees; featurizing, by the code duplication detection computer program, the abstract syntax trees into incremental function feature vectors by converting the abstract syntax tree into vector representations; generating, by the code duplication detection computer program, dense feature vectors from the incremental function feature vectors using the dimension reduction model; computing, by the code duplication detection computer program, a cosine similarity between the generated dense vectors and the near duplicate centroid and the exact duplicate centroid; ranking, by the code duplication detection computer program, the near duplicate centroid and the exact duplicate centroid in descending order based on the similarity; thresholding, by the code duplication detection computer program, the ranked near duplicate centroid and the exact duplicate centroid; selecting, by the code duplication detection computer program, a top most ranked centroid; and identifying, by the code duplication detection computer program, the incremental function as a duplicate of data points in a cluster of the top-most centroid. 17 . The method of claim 16 , wherein the dimension reduction model comprises truncated Singular Value Decomposition. 18 . The method of claim 16 , further comprising: applying, by the code duplication detection computer program, Natural Language Processing (NLP) to generate features for the abstract syntax trees. 19 . The method of claim 16 , further comprising: applying, by the code duplication detection computer program, a de-noising filter to the plurality of code snippets or the abstract syntax trees. 20 . The method of claim 19 , wherein the de-noising filter filters code snippets or abstract syntax trees that are irrelevant.

Assignees

Inventors

Classifications

  • G06F8/4435Primary

    Detection or removal of dead or redundant code · CPC title

  • Parsing · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023185550A1 cover?
Systems and methods for detecting code duplication are disclosed. In one embodiment, a method for detecting exact code snippet duplicates may include: (1) representing, by a code duplication detection computer program, each of a plurality of code snippets in a codebase as an abstract syntax trees; (2) featurizing, by the code duplication detection computer program, the abstract syntax trees int…
Who is the assignee on this patent?
Jpmorgan Chase Bank Na
What technology area does this patent fall under?
Primary CPC classification G06F8/4435. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jun 15 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).