Code generation through reinforcement learning using code-quality rewards

US2024192927A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2024192927-A1
Application numberUS-202418582248-A
CountryUS
Kind codeA1
Filing dateFeb 20, 2024
Priority dateDec 17, 2021
Publication dateJun 13, 2024
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A deep learning model trained to learn to predict source code is tuned for a target source code generation task through reinforcement learning using a reward score that considers the quality of the source code predicted during the tuning process. The reward score is adjusted to consider code-quality factors and source code metrics. The code-quality factors account for the predicted source code having syntactic correctness, successful compilation, successful execution, successful invocation, readability, functional correctness, and coverage. The source code metrics generate a score based on how close the predicted source code is to a ground truth code.

First claim

Opening claim text (preview).

What is claimed: 1 . A system comprising: a processor; and a memory that stores a program configured to be executed by the processor, the program comprising instructions to perform actions that: access a first deep learning model previously trained to generate source code for a first source code task, wherein the first deep learning model comprises parameters learned through cross-entropy loss; tune the parameters of the first deep learning model to train a second deep learning model to learn to generate source code for a second source code task, wherein tune the parameters of the first deep learning model to train the second deep learning model comprises instructions to perform actions that: input a training sample to the first deep learning model and to the second deep learning model, wherein the first deep learning model predicts a first predicted source code snippet over T timesteps, wherein the second deep learning model predicts a second predicted source code snippet over T timesteps; compute a code-quality reward for the second predicted source code snippet, wherein the code-quality reward is based on syntax correctness of the second predicted source code snippet, successful execution of the second predicted source code snippet, successful compilation of the second predicted source code snippet, and successful invocation of the second predicted source code snippet; compute a reward for the second predicted source code snippet at each timestep t based on a divergence between an output distribution from the first deep learning model at each time step t and an output distribution from the second deep learning model at each time step t; add the code-quality reward to the reward of the last timestep; compute a policy loss based on the rewards of each timestep t; and backpropagate the policy loss to the second deep learning model to adjust the parameters of the second deep learning model; and deploy the second deep learning model in an inference system to generate source code for the second source code task. 2 . The system of claim 1 , wherein the code-quality reward further comprises a metric score based on a similarity between the second predicted source code snippet and a ground truth source code snippet associated with the training sample. 3 . The system of claim 2 , wherein the metric score is based on a Bilingual Evaluation Understudy (BLEU) score and/or a Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score. 4 . The system of claim 1 , wherein the code-quality reward further comprises a score for functional correctness and readability of the second predicted source code snippet. 5 . The system of claim 1 , wherein compute the policy loss based on the rewards of each timestep t further comprises instructions to perform actions that: compute a generalized advantage estimation at each timestep t based on the reward at each respective timestep t and a value function output at the respective timestep t from the second deep learning model. 6 . The system of claim 5 , wherein the program comprises further instructions to perform actions that: compute a state-value function at each timestep t based on the generalized advantage estimation at each respective timestep t and the value function output at each respective timestep t. 7 . The system of claim 6 , wherein the program comprises further instructions to perform actions that: apply a clipped surrogate objective function to the generalized advantage estimation at each timestep t; and compute a value estimate error loss for each value function output at each timestep t. 8 . The system of claim 6 , wherein the program comprises further instructions to perform actions that: compute the policy loss as a sum of the clipped surrogate objective function overall T timesteps and the value estimate error loss overall T timesteps. 9 . The system of claim 1 , wherein the first deep learning model is a neural transformer model with attention. 10 . The system of claim 1 , wherein the first deep learning model is a decoder-only neural transformer model with attention. 11 . A computer-implemented method, comprising: selecting a first deep learning model trained to generate source code for a first source code generation task, wherein parameters of the first deep learning model are determined from a cross-entropy loss; generating a second deep learning model having parameters of the first deep learning model; updating the parameters of the second deep learning model for the second deep leaning model to learn to generate source code for a second code generation task, wherein updating the parameters of the second deep learning model further comprises: applying a training sample to the first deep learning model for the first deep learning model to predict a first source code snippet over T timesteps; applying the training sample to the second deep learning model for the second deep learning model to predict a second source code snippet over T timesteps; generating a code reward score for the second source code snippet based on syntax correctness of the second predicted source code snippet, successful execution of the second predicted source code snippet, successful compilation of the second predicted source code snippet, and/or successful invocation of the second predicted source code snippet; determining a reward at each timestep t of the T timesteps, wherein at each timestep t of the T timesteps, the second deep learning model predicts a token of the second predicted source code snippet; augmenting the reward at a last timestep with the code reward score; computing a policy loss based on the rewards for each timestep t of the T timesteps; and backpropagating the policy loss to the second deep learning model to adjust the parameters of the second deep learning model; and deploying the second deep learning model to perform the second code generation task. 12 . The computer-implemented method of claim 11 , comprising: computing a similarity score for the second predicted source code snippet with respect to a ground truth code snippet of the training sample; and incorporating the similarity score into the code reward score. 13 . The computer-implemented method of claim 11 , wherein the similarity score is based on a BiLingual Evaluation Understudy (BLEU) metric or a Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric. 14 . The computer-implemented method of claim 11 , further comprising: computing a generalized advantage estimation at each timestep t of the T timesteps based on a reward at each respective timestep t of the T timesteps and a value function output at each respective timestep t of the T timesteps from the second deep learning model. 15 . The computer-implemented method of claim 11 , further comprising: computing a state-value function at each timestep t of the T timesteps based on the generalized advantage estimation at each respective timestep t of the T timesteps and the value function output at each respective timestep t of the T timesteps. 16 . The computer-implemented method of claim 12 , further comprising: applying a clipped surrogate objective function to the generalized advantage estimation at each respective timestep t of the T timesteps; and computing a value estimate error loss for each value function output at each respective timestep t of the T timesteps. 17 . The computer-implemented method of claim 16 , further comprising: computing the policy loss as a sum of the clipped surrogate objective function overall the

Assignees

Inventors

Classifications

  • Architecture, e.g. interconnection topology · CPC title

  • Validation; Performance evaluation; Active pattern learning techniques · CPC title

  • Target code generation · CPC title

  • Software metrics · CPC title

  • Non-supervised learning, e.g. competitive learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2024192927A1 cover?
A deep learning model trained to learn to predict source code is tuned for a target source code generation task through reinforcement learning using a reward score that considers the quality of the source code predicted during the tuning process. The reward score is adjusted to consider code-quality factors and source code metrics. The code-quality factors account for the predicted source code …
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F8/33. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jun 13 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).