Semantic code retrieval using graph matching
US-11720346-B2 · Aug 8, 2023 · US
US12093671B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12093671-B2 |
| Application number | US-202217731593-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 28, 2022 |
| Priority date | Apr 28, 2022 |
| Publication date | Sep 17, 2024 |
| Grant date | Sep 17, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques are described herein for translating source code using sparse-self attention. In various implementations, a source code snippet in a first programming language may be processed to obtain graph(s) representing snippet tokens, and relationships therebetween. Based on the graph(s), a subset of snippet token pairs may be identified from a superset of all possible token pairs in the source code snippet. Each token pair of the subset may include snippet tokens that are represented by nodes connected by one or more edges of the one or more graphs. A self-attention network of a translation machine learning model may be adapted to sparsely attend across the identified subset of token pairs. The source code snippet may then be processed based on the adapted translation machine learning model to generate a translation of the source code snippet in the second programming language.
Opening claim text (preview).
What is claimed is: 1. A method for translating a source code snippet from a first programming language to a second programming language, the method implemented by one or more processors and comprising: obtaining one or more graphs representing snippet tokens, and relationships between the snippet tokens, contained in the source code snippet written in the first programming language; based on the one or more graphs, selecting, from a superset of all possible pairs of the snippet tokens in the source code snippet, a subset of snippet token pairs that comprises less than the superset of all possible pairs, wherein each token pair of the selected subset includes snippet tokens that are represented by nodes connected by fewer than a threshold number of edges of the one or more graphs; adapting a self-attention network of a large language model to sparsely attend across token pairs of the selected subset of token pairs; and processing the source code snippet based on the adapted large language model to translate the source code snippet from the first programming language to the second programming language. 2. The method of claim 1 , wherein the one or more graphs include at least one of a data flow graph (DFG), a control flow graph (CFG), or an abstract syntax tree (AST) representing the source code snippet. 3. The method of claim 1 , wherein the edges of one or more of the graphs represent dependencies between the snippet tokens. 4. The method of claim 1 , wherein the source code snippet comprises a function, and the method further includes processing an entire source code file that contains the function to identify global tokens defined in a portion of the source code file outside of the function, wherein the self-attention network is adapted based at least in part on the global tokens. 5. The method of claim 4 , further comprising adapting the self-attention network of the large language model to attend between each of the global tokens and all other tokens of the source code file. 6. The method of claim 4 , further comprising: analyzing the one or more graphs to identify, as inter-function token pairs, tokens from different functions that are connected by one or more edges of one or more of the graphs; and adapting the self-attention network to further sparsely attend across the inter-function token pairs. 7. The method of claim 6 , wherein one or more of the inter-function pairs comprises a function definition and a function call. 8. The method of claim 4 , further comprising: identifying dependencies between one or more other functions of the source code file and the function defined in the source code snippet; and adapting the self-attention network of the large language model to further sparsely attend based on the identified dependencies. 9. The method of claim 1 , further comprising adapting the self-attention network to attend across other randomly-selected token pairs of the superset. 10. A system for translating a source code snippet from a first programming language to a second programming language, the system comprising one or more processors and memory storing instructions that, in response to execution of the instructions, cause the one or more processors to: obtain one or more graphs representing snippet tokens, and relationships between the snippet tokens, contained in the source code snippet written in the first programming language; based on the one or more graphs, select, from a superset of all possible pairs of the snippet tokens in the source code snippet, a subset of snippet token pairs that comprises less than the superset of all possible pairs, wherein each token pair of the selected subset includes snippet tokens that are represented by nodes connected by fewer than a threshold number of edges of the one or more graphs; adapt a self-attention network of a large language model to sparsely attend across token pairs of the selected subset of token pairs; and process the source code snippet based on the adapted large language model to translate the source code snippet from the first programming language to the second programming language. 11. The system of claim 10 , wherein the one or more graphs include at least one of a data flow graph (DFG), a control flow graph (CFG), or an abstract syntax tree (AST) representing the source code snippet. 12. The system of claim 10 , wherein the edges of one or more of the graphs represent dependencies between the tokens. 13. The system of claim 10 , wherein the source code snippet comprises a function, and the instructions further includes instructions to process an entire source code file that contains the function to identify global tokens defined in a portion of the source code file outside of the function, wherein the self-attention network is adapted based at least in part on the global tokens. 14. The system of claim 13 , further comprising instructions to adapt the self-attention network of the large language model to attend between each of the global tokens and all other tokens of the source code file. 15. The system of claim 13 , further comprising instructions to: analyze the one or more graphs to identify, as inter-function token pairs, tokens from different functions that are connected by one or more edges of one or more of the graphs; and adapt the self-attention network to further sparsely attend across the inter-function token pairs. 16. The system of claim 15 , wherein one or more of the inter-function pairs comprises a function definition and a function call. 17. The system of claim 13 , further comprising instructions to: identify dependencies between one or more other functions of the source code file and the function defined in the source code snippet; and adapt the self-attention network of the large language model to further sparsely attend based on the identified dependencies. 18. The system of claim 10 , further comprising instructions to adapt the self-attention network to attend across other randomly-selected pairs of the superset. 19. A non-transitory computer-readable medium for translating a source code snippet from a first programming language to a second programming language, the medium comprising instructions that, in response to execution of the instructions by a processor, cause the processor to: obtain one or more graphs representing snippet tokens, and relationships between the snippet tokens, contained in the source code snippet written in the first programming language; based on the one or more graphs, select, from a superset of all possible pairs of the snippet tokens in the source code snippet, a subset of snippet token pairs that comprises less than the superset of all possible pairs, wherein each token pair of the selected subset includes snippet tokens that are represented by nodes connected by fewer than a threshold number of edges of the one or more graphs; adapt a self-attention network of a large language model to sparsely attend across token pairs of the selected subset of token pairs; and process the source code snippet based on the adapted large language model to translate the source code snippet from the first programming language to the second programming language. 20. The non-transitory computer-readable medium of claim 19 , wherein the one or more graphs include at least one of a data flow graph (DFG), a control flow graph (CFG), or an abstract syntax tree (AST) representing the source code snippet.
Related publications grouped by family.
Answers are generated from the same data shown on this page.