Neural network system for text classification
US-2021034707-A1 · Feb 4, 2021 · US
US11422798B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11422798-B2 |
| Application number | US-202016801218-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 26, 2020 |
| Priority date | Feb 26, 2020 |
| Publication date | Aug 23, 2022 |
| Grant date | Aug 23, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for context-based word embedding for programming artifacts are described herein. An aspect includes determining a plurality of keywords based on a corpus of programming artifacts, the corpus of programming artifacts including source code corresponding to a software project. Another aspect includes determining a plurality of context/keyword pair sets based on the plurality of keywords and the corpus of programming artifacts, wherein each context/keyword pair set of the plurality of context/keyword pair sets includes a first keyword, a second keyword, and a context type corresponding to a co-occurrence of the first keyword and the second keyword in the corpus of programming artifacts. Another aspect includes constructing a word embedding matrix based on the plurality of context/keyword pair sets.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method comprising: determining, by a processor, a plurality of keywords based on a corpus of programming artifacts, the corpus of programming artifacts comprising source code corresponding to a software project; determining a plurality of context/keyword pair sets based on the plurality of keywords and the corpus of programming artifacts, wherein each context/keyword pair set of the plurality of context/keyword pair sets comprises a first keyword, a second keyword, and a context type corresponding to a co-occurrence of the first keyword and the second keyword in the corpus of programming artifacts; and constructing a word embedding matrix based on the plurality of context/keyword pair sets, wherein constructing the word embedding matrix based on the plurality of context/keyword pair sets comprises: training a latent embedding matrix based on the plurality of context/keyword pair sets; creating a manifest feature vector based on the plurality of context/keyword pair sets and the corpus of programming artifacts; and combining the latent embedding matrix with the manifest feature vector corresponding to the corpus of programming artifacts to construct the word embedding matrix, wherein the combining includes ranking the plurality of context/keyword pair sets based on the manifest feature vector; and wherein the word embedding matrix is used to train a recurrent neural network (RNN) to process source code. 2. The method of claim 1 , wherein determining the plurality of keywords comprises: determining a naming convention of the corpus of programming artifacts; determining a plurality of tokens based on the determined naming convention; constructing a manifest feature vector based on the plurality of tokens and the corpus of programming artifacts; ranking the plurality of tokens based on the manifest feature vector; and selecting a subset of the plurality of tokens as keywords based on the manifest feature vector. 3. The method of claim 2 , wherein the naming convention comprises one of camel case, kebab case, and snake case. 4. The method of claim 1 , wherein the context type corresponds to a type of a statement in the source code of the corpus of programming artifacts, wherein the first keyword and the second keyword co-occur in the statement. 5. The method of claim 1 , wherein the context type corresponds to a business rule corresponding to the corpus of programming artifacts, wherein the first keyword and the second keyword co-occur in the business rule. 6. The method of claim 1 , wherein the context type corresponds to a common prefix or suffix of the first keyword and the second keyword in the corpus of programming artifacts. 7. A system comprising: a memory having computer readable instructions; and one or more processors for executing the computer readable instructions, the computer readable instructions controlling the one or more processors to perform operations comprising: determining a plurality of keywords based on a corpus of programming artifacts, the corpus of programming artifacts comprising source code corresponding to a software project; determining a plurality of context/keyword pair sets based on the plurality of keywords and the corpus of programming artifacts, wherein each context/keyword pair set of the plurality of context/keyword pair sets comprises a first keyword, a second keyword, and a context type corresponding to a co-occurrence of the first keyword and the second keyword in the corpus of programming artifacts; and constructing a word embedding matrix based on the plurality of context/keyword pair sets, wherein constructing the word embedding matrix based on the plurality of context/keyword pair sets comprises: training a latent embedding matrix based on the plurality of context/keyword pair sets; and creating a manifest feature vector based on the plurality of context/keyword pair sets and the corpus of programming artifacts; and combining the latent embedding matrix with the manifest feature vector corresponding to the corpus of programming artifacts to construct the word embedding matrix, wherein the combining includes ranking the plurality of context/keyword pair sets based on the manifest feature vector; and wherein the word embedding matrix is used to train a recurrent neural network (RNN) to process source code. 8. The system of claim 7 , wherein determining the plurality of keywords comprises: determining a naming convention of the corpus of programming artifacts; determining a plurality of tokens based on the determined naming convention; constructing a manifest feature vector based on the plurality of tokens and the corpus of programming artifacts; ranking the plurality of tokens based on the manifest feature vector; and selecting a subset of the plurality of tokens as keywords based on the manifest feature vector. 9. The system of claim 8 , wherein the naming convention comprises one of camel case, kebab case, and snake case. 10. The system of claim 7 , wherein the context type corresponds to a type of a statement in the source code of the corpus of programming artifacts, wherein the first keyword and the second keyword co-occur in the statement. 11. The system of claim 7 , wherein the context type corresponds to a business rule corresponding to the corpus of programming artifacts, wherein the first keyword and the second keyword co-occur in the business rule. 12. The system of claim 7 , wherein the context type corresponds to a common prefix or suffix of the first keyword and the second keyword in the corpus of programming artifacts. 13. A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by one or more processors to cause the one or more processors to perform operations comprising: determining a plurality of keywords based on a corpus of programming artifacts, the corpus of programming artifacts comprising source code corresponding to a software project; determining a plurality of context/keyword pair sets based on the plurality of keywords and the corpus of programming artifacts, wherein each context/keyword pair set of the plurality of context/keyword pair sets comprises a first keyword, a second keyword, and a context type corresponding to a co-occurrence of the first keyword and the second keyword in the corpus of programming artifacts; and constructing a word embedding matrix based on the plurality of context/keyword pair sets, wherein constructing the word embedding matrix based on the plurality of context/keyword pair sets comprises: training a latent embedding matrix based on the plurality of context/keyword pair sets; and creating a manifest feature vector based on the plurality of context/keyword pair sets and the corpus of programming artifacts; and combining the latent embedding matrix with the manifest feature vector corresponding to the corpus of programming artifacts to construct the word embedding matrix, wherein the combining includes ranking the plurality of context/keyword pair sets based on the manifest feature vector; and wherein the word embedding matrix is used to train a recurrent neural network (RNN) to process source code. 14. The computer program product of claim 13 , wherein determining the plurality of keywords comprises: determining a naming convention of the corpus of programming artifacts; determining a plurality of tokens based on the determined naming convention; constructing a manifest feature vector based on the plurality of tokens and the corpus of programming a
Combinations of networks · CPC title
Recurrent networks, e.g. Hopfield networks · CPC title
characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU] · CPC title
Supervised learning · CPC title
Program documentation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.