Techniques for unsupervised learning embeddings on source code tokens from non-local contexts

US10901708B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10901708-B1
Application numberUS-201816198969-A
CountryUS
Kind codeB1
Filing dateNov 23, 2018
Priority dateNov 23, 2018
Publication dateJan 26, 2021
Grant dateJan 26, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for unsupervised learning of embeddings on source code from non-local contexts are described. Code can be processed to generate an abstract syntax tree (AST) which represents syntactic paths between tokens in the code. Once the AST(s) have been generated, the paths in the AST(s) can be crawled to identify terminals (e.g., leaf nodes in the AST) and paths between terminals can be identified. The pairs of tokens identified at the ends of each path can then be used to generate a cooccurrence matrix. For example, if X number of unique terminals are identified, a matrix of size X by X can be generated to indicate a frequency at which pairs of terminals cooccur. This cooccurrence matrix can then be used as input to existing techniques for learning vector-space embeddings, such as word2vec, GloVe, Swivel, etc.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: receiving a request to perform code analysis, the request include a reference to a plurality of source code files stored in a storage service; retrieving the plurality of source code files from the storage service using the reference; generating an abstract syntax tree corresponding to the plurality of source code files, the abstract syntax tree including a plurality of paths, each path defined by a first terminal node and a second terminal node; filtering the plurality of paths based on at least one hyperparameter to determine a subset of paths; determining a number of unique terminal node values in the subset of paths; identifying pairs of terminal node values corresponding to the first terminal node and the second terminal node of each path of the subset of paths; counting a number of times each pair of terminal node values co-occur; generating a cooccurrence matrix for the abstract syntax tree using the number of unique terminal values in the subset of paths and the number of times each pair of terminal node values co-occur; and generating one or more word embeddings using the cooccurrence matrix. 2. The computer-implemented method of claim 1 , wherein filtering the plurality of paths based on at least one hyperparameter to determine a subset of paths, further comprises: identifying the subset of paths having a number of hops less than or equal to a number defined by a path hop hyperparameter. 3. The computer-implemented method of claim 1 , wherein filtering the plurality of paths based on at least one hyperparameter to determine a subset of paths, further comprises: identifying the subset of paths which do not include a node type defined by a node type hyperparameter. 4. A computer-implemented method comprising: obtaining a plurality of code files; generating an abstract syntax tree corresponding to the plurality of code files; identifying a plurality of paths in the abstract syntax tree, each path having a first terminal node and a second terminal node; identifying pairs of terminal node values corresponding to the first terminal node and the second terminal node of each of the plurality of paths; counting a number of times each pair of terminal node values co-occur; generating a cooccurrence matrix for the abstract syntax tree using the number of times each pair of terminal node values co-occur; and generating one or more word embeddings using the cooccurrence matrix. 5. The computer-implemented method of claim 4 , wherein generating a cooccurrence matrix for the abstract syntax tree using one or more hyperparameters further comprises: dividing each terminal node value into subtokens; and identifying a number of unique subtokens, wherein the cooccurrence matrix is a square matrix having an order corresponding to the number of unique subtokens. 6. The computer-implemented method of claim 4 , wherein identifying a plurality of paths in the abstract syntax tree, each path having a first terminal node and a second terminal node, further comprises: filtering the plurality of paths to identify a subset of the plurality of paths having a number of hops less than or equal to a number defined by a path hop hyperparameter. 7. The computer-implemented method of claim 4 , wherein identifying a plurality of paths in the abstract syntax tree, each path having a first terminal node and a second terminal node, further comprises: filtering the plurality of paths to identify a subset of the plurality of paths having a starting point defined by a starting point hyperparameter. 8. The computer-implemented method of claim 4 , wherein identifying a plurality of paths in the abstract syntax tree, each path having a first terminal node and a second terminal node, further comprises: filtering the plurality of paths to identify a subset of the plurality of paths which do not include a node type defined by a node type hyperparameter. 9. The computer-implemented method of claim 4 , wherein identifying a plurality of paths in the abstract syntax tree, each path having a first terminal node and a second terminal node, further comprises: sampling a random subset of the plurality of paths using a subsampling hyperparameter. 10. The computer-implemented method of claim 4 , wherein the plurality of code files includes a plurality of source code files. 11. The computer-implemented method of claim 4 , wherein obtaining a plurality of code files further comprises: decompiling byte code to generate the plurality of code files. 12. A system comprising: a first one or more electronic devices implementing a storage service; and a second one or more electronic devices implementing a code embedding training service, the code embedding training service including instructions that upon execution cause the code embedding training service to: obtain a plurality of code files; generate an abstract syntax tree corresponding to the plurality of code files; identify a plurality of paths in the abstract syntax tree, each path having a first terminal node and a second terminal node; identify pairs of terminal node values corresponding to the first terminal node and the second terminal node of each of the plurality of paths; count a number of times each pair of terminal node values co-occur; generate a cooccurrence matrix for the abstract syntax tree using the number of times each pair of terminal node values co-occur; and generate one or more word embeddings using the cooccurrence matrix. 13. The system of claim 12 , wherein to generate a cooccurrence matrix for the abstract syntax tree using one or more hyperparameters, the instructions when executed further cause the code embedding training service to: dividing each terminal node value into subtokens; and identify a number of unique subtokens, wherein the cooccurrence matrix is a square matrix having an order corresponding to the number of unique subtokens. 14. The system of claim 12 , wherein to identify a plurality of paths in the abstract syntax tree, each path having a first terminal node and a second terminal node, the instructions when executed further cause the code embedding training service to: filter the plurality of paths to identify a subset of the plurality of paths having a number of hops less than or equal to a number defined by a path hop hyperparameter. 15. The system of claim 12 , wherein to identify a plurality of paths in the abstract syntax tree, each path having a first terminal node and a second terminal node, the instructions when executed further cause the code embedding training service to: filter the plurality of paths to identify a subset of the plurality of paths having a starting point defined by a starting point hyperparameter. 16. The system of claim 12 , wherein to identify a plurality of paths in the abstract syntax tree, each path having a first terminal node and a second terminal node, the instructions when executed further cause the code embedding training service to: filter the plurality of paths to identify a subset of the plurality of paths which do not include a node type defined by a node type hyperparameter. 17. The system of claim 12 , wherein to identify a plurality of paths in the abstract syntax tree, each path having a first terminal node and a second terminal node, the instructions when executed further cause the code embedding training service to: sample a random subset of the plurality of paths using a subsampling hyperparameter. 18. The system of claim 12 , wherein the

Assignees

Inventors

Classifications

  • Knowledge engineering; Knowledge acquisition · CPC title

  • Machine learning · CPC title

  • Parsing · CPC title

  • G06F8/42Primary

    Syntactic analysis · CPC title

  • Programming languages or programming paradigms · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10901708B1 cover?
Techniques for unsupervised learning of embeddings on source code from non-local contexts are described. Code can be processed to generate an abstract syntax tree (AST) which represents syntactic paths between tokens in the code. Once the AST(s) have been generated, the paths in the AST(s) can be crawled to identify terminals (e.g., leaf nodes in the AST) and paths between terminals can be iden…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F8/42. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 26 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).