Context-based word embedding for programming artifacts

US11422798B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11422798-B2
Application numberUS-202016801218-A
CountryUS
Kind codeB2
Filing dateFeb 26, 2020
Priority dateFeb 26, 2020
Publication dateAug 23, 2022
Grant dateAug 23, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for context-based word embedding for programming artifacts are described herein. An aspect includes determining a plurality of keywords based on a corpus of programming artifacts, the corpus of programming artifacts including source code corresponding to a software project. Another aspect includes determining a plurality of context/keyword pair sets based on the plurality of keywords and the corpus of programming artifacts, wherein each context/keyword pair set of the plurality of context/keyword pair sets includes a first keyword, a second keyword, and a context type corresponding to a co-occurrence of the first keyword and the second keyword in the corpus of programming artifacts. Another aspect includes constructing a word embedding matrix based on the plurality of context/keyword pair sets.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: determining, by a processor, a plurality of keywords based on a corpus of programming artifacts, the corpus of programming artifacts comprising source code corresponding to a software project; determining a plurality of context/keyword pair sets based on the plurality of keywords and the corpus of programming artifacts, wherein each context/keyword pair set of the plurality of context/keyword pair sets comprises a first keyword, a second keyword, and a context type corresponding to a co-occurrence of the first keyword and the second keyword in the corpus of programming artifacts; and constructing a word embedding matrix based on the plurality of context/keyword pair sets, wherein constructing the word embedding matrix based on the plurality of context/keyword pair sets comprises: training a latent embedding matrix based on the plurality of context/keyword pair sets; creating a manifest feature vector based on the plurality of context/keyword pair sets and the corpus of programming artifacts; and combining the latent embedding matrix with the manifest feature vector corresponding to the corpus of programming artifacts to construct the word embedding matrix, wherein the combining includes ranking the plurality of context/keyword pair sets based on the manifest feature vector; and wherein the word embedding matrix is used to train a recurrent neural network (RNN) to process source code. 2. The method of claim 1 , wherein determining the plurality of keywords comprises: determining a naming convention of the corpus of programming artifacts; determining a plurality of tokens based on the determined naming convention; constructing a manifest feature vector based on the plurality of tokens and the corpus of programming artifacts; ranking the plurality of tokens based on the manifest feature vector; and selecting a subset of the plurality of tokens as keywords based on the manifest feature vector. 3. The method of claim 2 , wherein the naming convention comprises one of camel case, kebab case, and snake case. 4. The method of claim 1 , wherein the context type corresponds to a type of a statement in the source code of the corpus of programming artifacts, wherein the first keyword and the second keyword co-occur in the statement. 5. The method of claim 1 , wherein the context type corresponds to a business rule corresponding to the corpus of programming artifacts, wherein the first keyword and the second keyword co-occur in the business rule. 6. The method of claim 1 , wherein the context type corresponds to a common prefix or suffix of the first keyword and the second keyword in the corpus of programming artifacts. 7. A system comprising: a memory having computer readable instructions; and one or more processors for executing the computer readable instructions, the computer readable instructions controlling the one or more processors to perform operations comprising: determining a plurality of keywords based on a corpus of programming artifacts, the corpus of programming artifacts comprising source code corresponding to a software project; determining a plurality of context/keyword pair sets based on the plurality of keywords and the corpus of programming artifacts, wherein each context/keyword pair set of the plurality of context/keyword pair sets comprises a first keyword, a second keyword, and a context type corresponding to a co-occurrence of the first keyword and the second keyword in the corpus of programming artifacts; and constructing a word embedding matrix based on the plurality of context/keyword pair sets, wherein constructing the word embedding matrix based on the plurality of context/keyword pair sets comprises: training a latent embedding matrix based on the plurality of context/keyword pair sets; and creating a manifest feature vector based on the plurality of context/keyword pair sets and the corpus of programming artifacts; and combining the latent embedding matrix with the manifest feature vector corresponding to the corpus of programming artifacts to construct the word embedding matrix, wherein the combining includes ranking the plurality of context/keyword pair sets based on the manifest feature vector; and wherein the word embedding matrix is used to train a recurrent neural network (RNN) to process source code. 8. The system of claim 7 , wherein determining the plurality of keywords comprises: determining a naming convention of the corpus of programming artifacts; determining a plurality of tokens based on the determined naming convention; constructing a manifest feature vector based on the plurality of tokens and the corpus of programming artifacts; ranking the plurality of tokens based on the manifest feature vector; and selecting a subset of the plurality of tokens as keywords based on the manifest feature vector. 9. The system of claim 8 , wherein the naming convention comprises one of camel case, kebab case, and snake case. 10. The system of claim 7 , wherein the context type corresponds to a type of a statement in the source code of the corpus of programming artifacts, wherein the first keyword and the second keyword co-occur in the statement. 11. The system of claim 7 , wherein the context type corresponds to a business rule corresponding to the corpus of programming artifacts, wherein the first keyword and the second keyword co-occur in the business rule. 12. The system of claim 7 , wherein the context type corresponds to a common prefix or suffix of the first keyword and the second keyword in the corpus of programming artifacts. 13. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by one or more processors to cause the one or more processors to perform operations comprising: determining a plurality of keywords based on a corpus of programming artifacts, the corpus of programming artifacts comprising source code corresponding to a software project; determining a plurality of context/keyword pair sets based on the plurality of keywords and the corpus of programming artifacts, wherein each context/keyword pair set of the plurality of context/keyword pair sets comprises a first keyword, a second keyword, and a context type corresponding to a co-occurrence of the first keyword and the second keyword in the corpus of programming artifacts; and constructing a word embedding matrix based on the plurality of context/keyword pair sets, wherein constructing the word embedding matrix based on the plurality of context/keyword pair sets comprises: training a latent embedding matrix based on the plurality of context/keyword pair sets; and creating a manifest feature vector based on the plurality of context/keyword pair sets and the corpus of programming artifacts; and combining the latent embedding matrix with the manifest feature vector corresponding to the corpus of programming artifacts to construct the word embedding matrix, wherein the combining includes ranking the plurality of context/keyword pair sets based on the manifest feature vector; and wherein the word embedding matrix is used to train a recurrent neural network (RNN) to process source code. 14. The computer program product of claim 13 , wherein determining the plurality of keywords comprises: determining a naming convention of the corpus of programming artifacts; determining a plurality of tokens based on the determined naming convention; constructing a manifest feature vector based on the plurality of tokens and the corpus of programming a

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

  • characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title

  • Supervised learning · CPC title

  • G06F8/73Primary

    Program documentation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11422798B2 cover?
Techniques for context-based word embedding for programming artifacts are described herein. An aspect includes determining a plurality of keywords based on a corpus of programming artifacts, the corpus of programming artifacts including source code corresponding to a software project. Another aspect includes determining a plurality of context/keyword pair sets based on the plurality of keywords…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F8/73. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 23 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).