Translating large source code using sparse self- attention

US12093671B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12093671-B2
Application numberUS-202217731593-A
CountryUS
Kind codeB2
Filing dateApr 28, 2022
Priority dateApr 28, 2022
Publication dateSep 17, 2024
Grant dateSep 17, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are described herein for translating source code using sparse-self attention. In various implementations, a source code snippet in a first programming language may be processed to obtain graph(s) representing snippet tokens, and relationships therebetween. Based on the graph(s), a subset of snippet token pairs may be identified from a superset of all possible token pairs in the source code snippet. Each token pair of the subset may include snippet tokens that are represented by nodes connected by one or more edges of the one or more graphs. A self-attention network of a translation machine learning model may be adapted to sparsely attend across the identified subset of token pairs. The source code snippet may then be processed based on the adapted translation machine learning model to generate a translation of the source code snippet in the second programming language.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for translating a source code snippet from a first programming language to a second programming language, the method implemented by one or more processors and comprising: obtaining one or more graphs representing snippet tokens, and relationships between the snippet tokens, contained in the source code snippet written in the first programming language; based on the one or more graphs, selecting, from a superset of all possible pairs of the snippet tokens in the source code snippet, a subset of snippet token pairs that comprises less than the superset of all possible pairs, wherein each token pair of the selected subset includes snippet tokens that are represented by nodes connected by fewer than a threshold number of edges of the one or more graphs; adapting a self-attention network of a large language model to sparsely attend across token pairs of the selected subset of token pairs; and processing the source code snippet based on the adapted large language model to translate the source code snippet from the first programming language to the second programming language. 2. The method of claim 1 , wherein the one or more graphs include at least one of a data flow graph (DFG), a control flow graph (CFG), or an abstract syntax tree (AST) representing the source code snippet. 3. The method of claim 1 , wherein the edges of one or more of the graphs represent dependencies between the snippet tokens. 4. The method of claim 1 , wherein the source code snippet comprises a function, and the method further includes processing an entire source code file that contains the function to identify global tokens defined in a portion of the source code file outside of the function, wherein the self-attention network is adapted based at least in part on the global tokens. 5. The method of claim 4 , further comprising adapting the self-attention network of the large language model to attend between each of the global tokens and all other tokens of the source code file. 6. The method of claim 4 , further comprising: analyzing the one or more graphs to identify, as inter-function token pairs, tokens from different functions that are connected by one or more edges of one or more of the graphs; and adapting the self-attention network to further sparsely attend across the inter-function token pairs. 7. The method of claim 6 , wherein one or more of the inter-function pairs comprises a function definition and a function call. 8. The method of claim 4 , further comprising: identifying dependencies between one or more other functions of the source code file and the function defined in the source code snippet; and adapting the self-attention network of the large language model to further sparsely attend based on the identified dependencies. 9. The method of claim 1 , further comprising adapting the self-attention network to attend across other randomly-selected token pairs of the superset. 10. A system for translating a source code snippet from a first programming language to a second programming language, the system comprising one or more processors and memory storing instructions that, in response to execution of the instructions, cause the one or more processors to: obtain one or more graphs representing snippet tokens, and relationships between the snippet tokens, contained in the source code snippet written in the first programming language; based on the one or more graphs, select, from a superset of all possible pairs of the snippet tokens in the source code snippet, a subset of snippet token pairs that comprises less than the superset of all possible pairs, wherein each token pair of the selected subset includes snippet tokens that are represented by nodes connected by fewer than a threshold number of edges of the one or more graphs; adapt a self-attention network of a large language model to sparsely attend across token pairs of the selected subset of token pairs; and process the source code snippet based on the adapted large language model to translate the source code snippet from the first programming language to the second programming language. 11. The system of claim 10 , wherein the one or more graphs include at least one of a data flow graph (DFG), a control flow graph (CFG), or an abstract syntax tree (AST) representing the source code snippet. 12. The system of claim 10 , wherein the edges of one or more of the graphs represent dependencies between the tokens. 13. The system of claim 10 , wherein the source code snippet comprises a function, and the instructions further includes instructions to process an entire source code file that contains the function to identify global tokens defined in a portion of the source code file outside of the function, wherein the self-attention network is adapted based at least in part on the global tokens. 14. The system of claim 13 , further comprising instructions to adapt the self-attention network of the large language model to attend between each of the global tokens and all other tokens of the source code file. 15. The system of claim 13 , further comprising instructions to: analyze the one or more graphs to identify, as inter-function token pairs, tokens from different functions that are connected by one or more edges of one or more of the graphs; and adapt the self-attention network to further sparsely attend across the inter-function token pairs. 16. The system of claim 15 , wherein one or more of the inter-function pairs comprises a function definition and a function call. 17. The system of claim 13 , further comprising instructions to: identify dependencies between one or more other functions of the source code file and the function defined in the source code snippet; and adapt the self-attention network of the large language model to further sparsely attend based on the identified dependencies. 18. The system of claim 10 , further comprising instructions to adapt the self-attention network to attend across other randomly-selected pairs of the superset. 19. A non-transitory computer-readable medium for translating a source code snippet from a first programming language to a second programming language, the medium comprising instructions that, in response to execution of the instructions by a processor, cause the processor to: obtain one or more graphs representing snippet tokens, and relationships between the snippet tokens, contained in the source code snippet written in the first programming language; based on the one or more graphs, select, from a superset of all possible pairs of the snippet tokens in the source code snippet, a subset of snippet token pairs that comprises less than the superset of all possible pairs, wherein each token pair of the selected subset includes snippet tokens that are represented by nodes connected by fewer than a threshold number of edges of the one or more graphs; adapt a self-attention network of a large language model to sparsely attend across token pairs of the selected subset of token pairs; and process the source code snippet based on the adapted large language model to translate the source code snippet from the first programming language to the second programming language. 20. The non-transitory computer-readable medium of claim 19 , wherein the one or more graphs include at least one of a data flow graph (DFG), a control flow graph (CFG), or an abstract syntax tree (AST) representing the source code snippet.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • G06F8/51Primary

    Source to source · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12093671B2 cover?
Techniques are described herein for translating source code using sparse-self attention. In various implementations, a source code snippet in a first programming language may be processed to obtain graph(s) representing snippet tokens, and relationships therebetween. Based on the graph(s), a subset of snippet token pairs may be identified from a superset of all possible token pairs in the sourc…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F8/51. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 17 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).