Semantic code search based on augmented programming language corpus

US11609748B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11609748-B2
Application numberUS-202117161545-A
CountryUS
Kind codeB2
Filing dateJan 28, 2021
Priority dateJan 28, 2021
Publication dateMar 21, 2023
Grant dateMar 21, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method may include obtaining machine-readable source code. The method may include parsing the source code for one or more code descriptions and identifying a section of the source code corresponding to each of the code descriptions. The method may include determining a description-code pair including a first element representing the code description and a second element representing the section of the source code corresponding to the code description. The method may include generating an augmented programming language corpus based on the description-code pair, the one or more code descriptions, and the source code. The method may include receiving a natural language search query for source-code recommendations, identifying source code from the augmented programming language corpus responsive to the natural language search query, and responding to the natural language search query with the identified source code.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: obtaining, by a processor, machine-readable source code; parsing, by the processor, the source code for one or more code descriptions; identifying, by the processor, a section of the source code corresponding to a code description of the one or more code descriptions; determining, by the processor, a description-code pair, the description-code pair including a first element representing the code description and a second element representing the section of the source code corresponding to the code description; generating, by the processor, an augmented programming language corpus using the description-code pair, the one or more code descriptions, and the source code; training, by the processor, a machine learning model to provide source-code recommendations based on the augmented programming language corpus; receiving, by the processor, a natural language search query for a source-code recommendation; identifying, by the processor using the machine learning model, the source code responsive to the search query; and responding, by the processor, to the natural language search query with the source code identified from the augmented programming language corpus. 2. The method of claim 1 , wherein the one or more code descriptions are code comments and identifying the section of the source code corresponding to a code description of the one or more code descriptions comprises: determining, by the processor, one or more heuristics relating a location of the code comment in a piece of source code to the section of the source code; determining, by the processor, the location of the code comment in the piece of source code; and locating, by the processor, the section of the source code to which the code comment corresponds based on the one or more heuristics and the location of the code comment in the piece of source code. 3. The method of claim 1 , wherein the natural language search query is received via a text-input field in an integrated development environment (IDE), the IDE including an interface for software development. 4. The method of claim 1 , wherein obtaining source code comprises: obtaining, by the processor, a source-code package; parsing, by the processor, the source-code package to identify one or more files, each file of the one or more files including at least a portion of the source code; and parsing, by the processor, the one or more files to identify files written in a target programming language. 5. The method of claim 1 , further comprising: generating, by the processor, a negatively classified example based on the description-code pair; and training, by the processor, the machine learning model to provide source-code recommendations based on the augmented programming language corpus and the negatively classified example. 6. The method of claim 1 , wherein responding to the natural language search query with the source code identified from the augmented programming language corpus comprises: mapping, by the processor, the natural language search query to a search vector; comparing, by the processor, the search vector to each description-code pair; determining, by the processor, a similarity score between the search vector and each description-code pair based on a cosine similarity between the search vector and each description-code pair; and returning, by the processor, the source code corresponding to the description-code pair based on the similarity score between the search vector and each description-code pair. 7. The method of claim 6 , wherein returning the source code corresponding to the description-code pair based on the similarity score between the search vector and each description-code pair comprises: ranking, by the processor, description-code pairs based on the similarity score between the search vector and each description-code pair; and returning, by the processor, one or more pieces of the source code corresponding to the description-code pairs based on the ranking. 8. One or more non-transitory computer-readable storage media configured to store instructions that, in response to being executed, cause a system to perform operations, the operations comprising: obtaining machine-readable source code; parsing the source code for one or more code descriptions; identifying a section of the source code corresponding to a code description of the one or more code descriptions; determining a description-code pair, the description-code pair including a first element representing the code description and a second element representing the section of the source code corresponding to the code description; generating an augmented programming language corpus using the description-code pair, the one or more code descriptions, and the source code; training a machine learning model to provide source-code recommendations based on the augmented programming language corpus; receiving a natural language search query for a source-code recommendation; identifying, by the machine learning model, the source code from the augmented programming language corpus responsive to the natural language search query; and responding to the natural language search query with the source code identified from the augmented programming language corpus. 9. The one or more non-transitory computer-readable storage media of claim 8 , wherein the one or more code descriptions are code comments and identifying the section of the source code corresponding to a code description of the one or more code descriptions comprises: determining one or more heuristics relating a location of the code comment in a piece of source code to the section of the source code; determining the location of the code comment in the piece of source code; and locating the section of the source code to which the code comment corresponds based on the one or more heuristics and the location of the code comment in the piece of source code. 10. The one or more non-transitory computer-readable storage media of claim 8 , wherein the natural language search query is received via a text-input field in an integrated development environment (IDE), the IDE including an interface for software development. 11. The one or more non-transitory computer-readable storage media of claim 8 , wherein obtaining source code comprises: obtaining a source-code package; parsing the source-code package to identify one or more files, each file of the one or more files including at least a portion of the source code; and parsing the one or more files to identify files written in a target programming language. 12. The one or more non-transitory computer-readable storage media of claim 8 , further comprising: generating a negatively classified example based on the description-code pair; and training the machine learning model to provide source-code recommendations based on the augmented programming language corpus and the negatively classified example. 13. The one or more non-transitory computer-readable storage media of claim 8 , wherein responding to the natural language search query with the source code identified from the augmented programming language corpus comprises: mapping the natural language search query to a search vector; comparing the search vector to each description-code pair; determining a similarity score between the search vector and each description-code pair based on a cosine similarity between the search vector and each description-code pair; and returning the source code corresponding to the description-code pair based on the similarity score between the search vector and each description-code pair.

Assignees

Inventors

Classifications

  • Software reuse · CPC title

  • Machine learning · CPC title

  • G06F8/75Primary

    Structural analysis for program understanding · CPC title

  • G06F8/33Primary

    Intelligent editors · CPC title

  • Natural language query formulation or dialogue systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11609748B2 cover?
A method may include obtaining machine-readable source code. The method may include parsing the source code for one or more code descriptions and identifying a section of the source code corresponding to each of the code descriptions. The method may include determining a description-code pair including a first element representing the code description and a second element representing the secti…
Who is the assignee on this patent?
Fujitsu Ltd
What technology area does this patent fall under?
Primary CPC classification G06F8/75. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 21 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).