Method and system for translation of codes based on semantic similarity

US2023034984A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023034984-A1
Application numberUS-202217743511-A
CountryUS
Kind codeA1
Filing dateMay 13, 2022
Priority dateJun 29, 2021
Publication dateFeb 2, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Code translation is an evolving field and due to advancements in the infrastructure and compute power. The existing methods for code translation are time and effort intensive. A method and system for translation of codes based on the semantic similarity have been provided. A machine learning model is developed, that understands and encapsulates the semantics of the code in the source side and translates the semantic equivalent code which is more maintainable and efficient compared to one to one translation. The system is configured to group a plurality of statements present in the source code together into blocks of code and comprehend the semantics of the block. The system is also trained to understand syntactically different but semantically similar statements. While understanding the semantics of the block and translating, the unused/duplicate code etc. gets eliminated. The translated code is better architected and native to the target environment.

First claim

Opening claim text (preview).

What is claimed is: 1 . A processor implemented method for translation of codes based on a semantic similarity, the method comprising: providing, via a user interface, a source code for translation as an input; parsing, via one or more hardware processors, the source code using a parser; creating, via the one or more hardware processors, a program flow graph using the parsed source code, wherein the program flow graph is used to establish a set of relations between a plurality of statements present in the source code; creating, via the one or more hardware processors, a matrix of the established set of relations between the plurality of statements; splitting, via the one or more hardware processors, the source code into a plurality of blocks using the created matrix, wherein the statements in one block out of the plurality of blocks are closer to each other as compared to other statements irrespective of their physical presence in the source code; vectorizing, via the one or more hardware processors, the plurality of blocks to get a plurality of vectorized blocks; training, via the one or more hardware processors, a model to understand a semantic equivalence between the plurality of blocks of the source code irrespective of a manner in which the plurality of blocks is syntactically coded; identifying, via the one or more hardware processors, semantically similar statements out of the plurality of vectorized blocks using the trained model; selecting, via the user interface, a target language in which the source code needs to be translated; and translating, via the one or more hardware processors, the source code into the target language using a decoder based on the identified semantically similar statements, wherein the decoder is a pretrained machine learning model and configured to choose semantically same, but syntactically different statements. 2 . The method of claim 1 wherein the decoder is configured to ensure a translation of functionally equivalent code in the target language that is native to a target architecture. 3 . The method of claim 1 , wherein the set of relations comprises: whether two statements amongst the plurality of statements in a program flow are part of the same path, whether there is a definition of a data element in a first statement which is used in the second statement, and whether the first statement contains a definition of the data element that influences the second statement. 4 . The method of claim 1 , further comprising eliminating one or more of unused and duplicate codes from the source code using the set of relations. 5 . The method of claim 1 , wherein the plurality of vectorized blocks are generated based on a plurality of properties of the source code comprising at least one of usage, value associated, scope of usage, data types, data structure, data size, and relations with other data elements. 6 . A system for translation of codes based on a semantic similarity, the system comprises: a user interface for providing a source code for translation as an input and a target language in which the source code needs to be translated; one or more hardware processors; a memory in communication with the one or more hardware processors, wherein the one or more first hardware processors are configured to execute programmed instructions stored in the one or more first memories, to: parse the source code using a parser; create a program flow graph using the parsed source code, wherein the program flow graph is used to establish a set of relations between a plurality of statements present in the source code; create a matrix of the established set of relations between the plurality of statements; split the source code into a plurality of blocks using the created matrix, wherein the statements in one block out of the plurality of blocks are closer to each other as compared to other statements irrespective of their physical presence in the source code; vectorize the plurality of blocks to get a plurality of vectorized blocks; train a model to understand a semantic equivalence between the plurality of blocks of the source code irrespective of a manner in which the plurality of blocks is syntactically coded; identify semantically similar statements out of the plurality of vectorized blocks using the trained model; and translate the source code into the target language using a decoder based on the identified semantically similar statements, wherein the decoder is a pretrained machine learning model and configured to choose semantically same, but syntactically different statements. 7 . The system of claim 6 , wherein the decoder is configured to ensure a translation of functionally equivalent code in the target language that is native to a target architecture. 8 . The system of claim 6 , wherein the set of relations comprises: whether two statements amongst the plurality of statements in a program flow are part of the same path, whether there is a definition of a data element in a first statement which is used in the second statement, and whether the first statement contains a definition of the data element that influences the second statement. 9 . The system of claim 6 further configured to eliminate one or more of unused and duplicate codes from the source code using the set of relations. 10 . The system of claim 6 , wherein the plurality of vectorized blocks are generated based on a plurality of properties of the source code comprising at least one of usage, value associated, scope of usage, data types, data structure, data size, and relations with other data elements. 11 . One or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause: providing, via a user interface, a source code for translation as an input; parsing, the source code using a parser; creating, via the one or more hardware processors, a program flow graph using the parsed source code, wherein the program flow graph is used to establish a set of relations between a plurality of statements present in the source code; creating, via the one or more hardware processors, a matrix of the established set of relations between the plurality of statements; splitting, via the one or more hardware processors, the source code into a plurality of blocks using the created matrix, wherein the statements in one block out of the plurality of blocks are closer to each other as compared to other statements irrespective of their physical presence in the source code; vectorizing, via the one or more hardware processors, the plurality of blocks to get a plurality of vectorized blocks; training, via the one or more hardware processors, a model to understand a semantic equivalence between the plurality of blocks of the source code irrespective of a manner in which the plurality of blocks is syntactically coded; identifying, via the one or more hardware processors, semantically similar statements out of the plurality of vectorized blocks using the trained model; selecting, via the user interface, a target language in which the source code needs to be translated; and translating, via the one or more hardware processors, the source code into the target language using a decoder based on the identified semantically similar statements, wherein the decoder is a pretrained machine learning model and configured to choose semantically same, but syntactically different statements. 12 . The one or more non-transitory machine-readable information storage mediums of claim 11 wherein the decoder is configured to ensure a translation of functionally equi

Assignees

Inventors

Classifications

  • Dependency analysis; Data or control flow analysis · CPC title

  • G06F8/51Primary

    Source to source · CPC title

  • Parsing · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Supervised learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023034984A1 cover?
Code translation is an evolving field and due to advancements in the infrastructure and compute power. The existing methods for code translation are time and effort intensive. A method and system for translation of codes based on the semantic similarity have been provided. A machine learning model is developed, that understands and encapsulates the semantics of the code in the source side and t…
Who is the assignee on this patent?
Tata Consultancy Services Ltd
What technology area does this patent fall under?
Primary CPC classification G06F8/51. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Feb 02 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).