Conditioning autoregressive language model to improve code migration

US11656867B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11656867-B2
Application numberUS-202217945376-A
CountryUS
Kind codeB2
Filing dateSep 15, 2022
Priority dateDec 29, 2020
Publication dateMay 23, 2023
Grant dateMay 23, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations are described herein for using machine learning to perform various tasks related to migrating source code based on relatively few (“few shots”) demonstrations. In various implementations, an autoregressive language model may be conditioned based on demonstration tuple(s). In some implementations, a demonstration tuple may include a pre-migration version of a first source code snippet and a post-migration version of the first source code snippet. In other implementations, demonstration tuples may include other data, such as intermediate forms (e.g., natural language descriptions or pseudocode), input-output pairs demonstrating intended behavior, etc. The autoregressive language model may be trained on corpora of source code and natural language documentation on the subject of computer programming. A pre-migration version of a source code file may be processed based on the conditioned autoregressive language model, and a post-migration version may be generated based on output generated based on the conditioned autoregressive model.

First claim

Opening claim text (preview).

What is claimed is: 1. A method implemented using one or more processors, comprising: conditioning a language model based on one or more demonstration tuples, wherein one or more of the demonstration tuples includes a first version of a first source code snippet that exists prior to a planned migration and a second version of the first source code snippet that is desired after the planned migration, and wherein the language model is trained on one or more corpuses of source code and one or more corpuses of natural language documentation on the subject of computer programming, and wherein the conditioning includes processing one or more of the demonstration tuples to generate one or more intermediate embeddings; processing a pre-migration version of a source code file based on the conditioned language model, wherein processing the pre-migration version of the source code file includes processing one or more of the intermediate embeddings in conjunction with the pre-migration version of the source code file as inputs for the conditioned language model for one or more subsequent iterations; and based on the processing of the pre-migration version of the source code file, generating a post-migration version of the source code file. 2. The method of claim 1 , wherein one or more of the demonstration tuples includes a third source code snippet, an example input for the third source code snippet, and a target output of the third source code snippet given the example input. 3. The method of claim 1 , wherein: in the second version of the first source code snippet, at least a first token is transformed into a target nomenclature; in the post-migration version of the source code file, at least a second token that is different from the first token is transformed into the target nomenclature. 4. The method of claim 3 , wherein the target nomenclature captures a desired coding style used by an entity. 5. The method of claim 3 , wherein the target nomenclature captures a desired coding style espoused by computer programming educational literature that is included in one or more of the corpuses of natural language documentation about computer programming. 6. The method of claim 1 , comprising receiving one or more of the demonstration tuples via textual input provided by a user. 7. The method of claim 1 , comprising selecting one or more of the demonstration tuples from a library of existing demonstration tuples based on user input. 8. The method of claim 7 , wherein the user input comprises a free-form natural language input spoken or typed by a user, and the selecting is based on semantic similarity between the free-form natural language input and the selected one or more of the demonstration tuples. 9. The method of claim 1 , wherein one or more of the demonstration tuples includes a natural language snippet that describes the first source code snippet, and wherein the method includes, based on the processing, generating another natural language snippet that describes the source code file. 10. The method of claim 1 , further comprising: performing a semantic comparison of the pre-migration source code file and the post-migration source code file; and based on the semantic comparison: selecting another post-migration version of the source code file from a distribution generated by the language model; or performing supervised training on the language model based on the pre-migration and post-migration versions of the source code file. 11. A method implemented using one or more processors, comprising: conditioning a language model based on one or more demonstration tuples, wherein one or more of the demonstration tuples includes a first version of a first source code snippet that exists prior to a planned migration, an example input for the first source code snippet, and a target output of the first source code snippet given the example input, wherein the language model is trained exclusively on one or more corpuses of source code and one or more corpuses of natural language documentation on the subject of computer programming, and wherein the conditioning includes processing one or more of the demonstration tuples to generate one or more intermediate embeddings; processing a pre-migration version of a source code file based on the conditioned language model, wherein processing the pre-migration version of the source code file includes processing one or more of the intermediate embeddings in conjunction with the pre-migration version of the source code file as inputs for the conditioned language model for one or more subsequent iterations; and based on the processing of the pre-migration version of the source code file, generating a post-migration version of the source code file. 12. A system comprising one or more processors and memory storing instructions that, in response to execution of the instructions by the one or more processors, cause the one or more processors to: condition a language model based on one or more demonstration tuples, wherein one or more of the demonstration tuples includes a first version of a first source code snippet that exists prior to a planned migration and a second version of the first source code snippet that is desired after the planned migration, wherein the language model is trained on one or more corpuses of source code and one or more corpuses of natural language documentation on the subject of computer programming, and wherein the instructions to condition include instructions to process one or more of the demonstration tuples to generate one or more intermediate embeddings; process a pre-migration version of a source code file based on the conditioned language model, wherein the instructions to process the pre-migration version of the source code file include instructions to process the pre-migration version of the source code file in conjunction with one or more of the intermediate embeddings as inputs for the conditioned language model for one or more subsequent iterations; and based on output generated based on processing the pre-migration version of the source code file using the language model, generate a post-migration version of the source code file. 13. The system of claim 12 , wherein one or more of the demonstration tuples includes a third source code snippet, an example input for the third source code snippet, and a target output of the third source code snippet given the example input. 14. The system of claim 12 , wherein: in the second version of the first source code snippet, at least a first token is transformed into a target nomenclature; in the post-migration version of the source code file, at least a second token that is different from the first token is transformed into the target nomenclature. 15. The system of claim 14 , wherein the target nomenclature captures a desired coding style used by an entity. 16. The system of claim 14 , wherein the target nomenclature captures a desired coding style espoused by computer programming educational literature that is included in one or more of the corpuses of natural language documentation about computer programming. 17. The system of claim 12 , comprising instructions to receive one or more of the demonstration tuples via textual input provided by a user. 18. The system of claim 12 , comprising instructions to select one or more of the demonstration tuples from a library of existing demonstration tuples based on user input. 19. The system of claim 18 , wherein the user input comprises a free-form natural language input

Assignees

Inventors

Classifications

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Generative networks · CPC title

  • Supervised learning · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • G06F40/20Primary

    Natural language analysis (semantic analysis of natural language G06F40/30) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11656867B2 cover?
Implementations are described herein for using machine learning to perform various tasks related to migrating source code based on relatively few (“few shots”) demonstrations. In various implementations, an autoregressive language model may be conditioned based on demonstration tuple(s). In some implementations, a demonstration tuple may include a pre-migration version of a first source code sn…
Who is the assignee on this patent?
Google Llc
What technology area does this patent fall under?
Primary CPC classification G06F40/20. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 23 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).