Representing source code in vector space to detect errors

US11334467B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11334467-B2
Application numberUS-201916402965-A
CountryUS
Kind codeB2
Filing dateMay 3, 2019
Priority dateMay 3, 2019
Publication dateMay 17, 2022
Grant dateMay 17, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method, system and computer program product for representing source code in vector space. The source code is parsed into an abstract syntax tree, which is then traversed to produce a sequence of tokens. Token embeddings may then be constructed for a subset of the sequence of tokens, which are inputted into an encoder artificial neural network (“encoder”) for encoding the token embeddings. A decoder artificial neural network (“decoder”) is initialized with a final internal cell state of the encoder. The decoder is run the same number of steps as the encoding performed by the encoder. After running the decoder and completing the training of the decoder to learn the inputted token embeddings, the final internal cell state of the encoder is used as the code representation vector which may be used to detect errors in the source code.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method for representing source code in vector space, the method comprising: parsing source code into an abstract syntax tree; traversing said abstract syntax tree to produce a sequence of tokens; constructing token embeddings for a subset of said sequence of tokens; inputting said token embeddings into an encoder artificial neural network for encoding said token embeddings; initializing a decoder artificial neural network with a final internal cell state of said encoder artificial neural network when encoding said token embeddings; running said decoder artificial neural network a same number of steps as encoding performed by said encoder artificial neural network; using said final internal cell state of said encoder artificial neural network as a code representation vector in response to completing said running of said decoder artificial neural network; and using said code representation vector to detect errors in said source code. 2. The computer-implemented method as recited in claim 1 , wherein said abstract syntax tree is traversed using a depth-first traversal. 3. The computer-implemented method as recited in claim 1 , wherein said abstract syntax tree is traversed using a structure-based traversal. 4. The computer-implemented method as recited in claim 1 further comprising: constructing a list of frequently occurring tokens found in said abstract syntax tree; and removing tokens from said sequence of tokens with a frequency below a frequency threshold to form said subset of said sequence of tokens. 5. The computer-implemented method as recited in claim 1 , wherein said token embeddings are randomly constructed. 6. The computer-implemented method as recited in claim 1 , wherein pretrained embeddings are used to construct said token embeddings. 7. The computer-implemented method as recited in claim 1 further comprising: computing a loss function based on a quality of reconstruction from running said decoder artificial neural network; updating internal parameters of said encoder artificial neural network and said decoder artificial neural network based on said computed loss function; and using said final internal cell state of said encoder artificial neural network as said code representation vector in response to completing said running of said decoder artificial neural network and in response to convergence of said updated internal parameters of said encoder artificial neural network and said decoder artificial neural network. 8. The computer-implemented method as recited in claim 1 , wherein said artificial neural network is a recurrent neural network. 9. A computer program product for representing source code in vector space, the computer program product comprising a computer readable storage medium having program code embodied therewith, the program code comprising the programming instructions for: parsing source code into an abstract syntax tree; traversing said abstract syntax tree to produce a sequence of tokens; constructing token embeddings for a subset of said sequence of tokens; inputting said token embeddings into an encoder artificial neural network for encoding said token embeddings; initializing a decoder artificial neural network with a final internal cell state of said encoder artificial neural network when encoding said token embeddings; running said decoder artificial neural network a same number of steps as encoding performed by said encoder artificial neural network; using said final internal cell state of said encoder artificial neural network as a code representation vector in response to completing said running of said decoder artificial neural network; and using said code representation vector to detect errors in said source code. 10. The computer program product as recited in claim 9 , wherein said abstract syntax tree is traversed using a depth-first traversal. 11. The computer program product as recited in claim 9 , wherein said abstract syntax tree is traversed using a structure-based traversal. 12. The computer program product as recited in claim 9 , wherein the program code further comprises the programming instructions for: constructing a list of frequently occurring tokens found in said abstract syntax tree; and removing tokens from said sequence of tokens with a frequency below a frequency threshold to form said subset of said sequence of tokens. 13. The computer program product as recited in claim 9 , wherein said token embeddings are randomly constructed. 14. The computer program product as recited in claim 9 , wherein pretrained embeddings are used to construct said token embeddings. 15. The computer program product as recited in claim 9 , wherein the program code further comprises the programming instructions for: computing a loss function based on a quality of reconstruction from running said decoder artificial neural network; updating internal parameters of said encoder artificial neural network and said decoder artificial neural network based on said computed loss function; and using said final internal cell state of said encoder artificial neural network as said code representation vector in response to completing said running of said decoder artificial neural network and in response to convergence of said updated internal parameters of said encoder artificial neural network and said decoder artificial neural network. 16. The computer program product as recited in claim 9 , wherein said artificial neural network is a recurrent neural network. 17. A system, comprising: a memory for storing a computer program for representing source code in vector space; and a processor connected to said memory, wherein said processor is configured to execute the program instructions of the computer program comprising: parsing source code into an abstract syntax tree; traversing said abstract syntax tree to produce a sequence of tokens; constructing token embeddings for a subset of said sequence of tokens; inputting said token embeddings into an encoder artificial neural network for encoding said token embeddings; initializing a decoder artificial neural network with a final internal cell state of said encoder artificial neural network when encoding said token embeddings; running said decoder artificial neural network a same number of steps as encoding performed by said encoder artificial neural network; using said final internal cell state of said encoder artificial neural network as a code representation vector in response to completing said running of said decoder artificial neural network; and using said code representation vector to detect errors in said source code. 18. The system as recited in claim 17 , wherein said abstract syntax tree is traversed using a depth-first traversal. 19. The system as recited in claim 17 , wherein said abstract syntax tree is traversed using a structure-based traversal. 20. The system as recited in claim 17 , wherein the program instructions of the computer program further comprise: constructing a list of frequently occurring tokens found in said abstract syntax tree; and removing tokens from said sequence of tokens with a frequency below a frequency threshold to form said subset of said sequence of tokens.

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11334467B2 cover?
A computer-implemented method, system and computer program product for representing source code in vector space. The source code is parsed into an abstract syntax tree, which is then traversed to produce a sequence of tokens. Token embeddings may then be constructed for a subset of the sequence of tokens, which are inputted into an encoder artificial neural network (“encoder”) for encoding the …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F11/3608. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 17 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).