Automatic data domain identification

US12333253B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12333253-B2
Application numberUS-202117529899-A
CountryUS
Kind codeB2
Filing dateNov 18, 2021
Priority dateNov 18, 2021
Publication dateJun 17, 2025
Grant dateJun 17, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus is disclosed which includes at least one processing device comprising a processor coupled to a memory. The at least one processing device, when executing program code, is configured to: extract one or more entities identified in a plurality of data artifacts based at least in part on one or more datasets, extract one or more entities identified in a plurality of code artifacts based at least in part on the one or more datasets, extract one or more entities identified in a plurality of user interface artifacts based at least in part on the one or more datasets, generate a set of dependency graphs each based at least in part on one or more relationships among the respective extracted one or more entities, and perform one or more of a lexical analysis and a semantic analysis on the set of dependency graphs to identify a data domain of the one or more datasets.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus, comprising: at least one processing device comprising a processor coupled to a memory, the at least one processing device, when executing program code, is configured to: extract one or more entities identified in a plurality of data artifacts based at least in part on one or more datasets; extract one or more entities identified in a plurality of code artifacts based at least in part on the one or more datasets; extract one or more entities identified in a plurality of user interface artifacts based at least in part on the one or more datasets; generate a set of dependency graphs each based at least in part on one or more relationships among the respective extracted one or more entities; and perform one or more of a lexical analysis and a semantic analysis on the set of dependency graphs to identify a data domain of the one or more datasets. 2. The apparatus of claim 1 , wherein the plurality of data artifacts comprises one or more of (a) one or more schemas with their table names and associated column names, (b) index, trigger, and stored procedures associated with the schemas, (c) relationships between the different tables and databases, (d) table data, and (e) documentation, logs, performance and operational profile of datasets. 3. The apparatus of claim 1 , wherein the plurality of code artifacts comprises source code and associated libraries. 4. The apparatus of claim 1 , wherein the plurality of user interface artifacts comprises one or more of (a) user interface screens with natural language text, (b) user interface form objects and formatting, and (c) user interface modalities. 5. The apparatus of claim 1 , wherein generating a set of dependency graphs each based at least in part on one or more relationships among the respective extracted one or more entities comprises generating a first dependency graph of the set of dependency graphs based at least in part on one or more relationships between the extracted one or more entities of the data artifacts and the extracted one or more entities of the code artifacts. 6. The apparatus of claim 5 , wherein generating a set of dependency graphs each based at least in part on one or more relationships among the respective extracted one or more entities further comprises generating a second dependency graph of the set of dependency graphs based at least in part on one or more relationships between the extracted one or more entities of the data artifacts and the extracted one or more entities of the user interface artifacts. 7. The apparatus of claim 6 , wherein generating a set of dependency graphs each based at least in part on one or more relationships among the respective extracted one or more entities further comprises generating a third dependency graph of the set of dependency graphs based at least in part on one or more relationships between the extracted one or more entities of the code artifacts and the extracted one or more entities of the user interface artifacts. 8. The apparatus of claim 1 , wherein the at least one processing device, when executing program code, is further configured to: retrieve the data artifacts from at least one of a database and a file system; and apply one of a data definition language operation and a data manipulation language operation to identify the one or more entities in each data artifact and to determine one or more relationships between the one or more entities. 9. A computer-implemented method, comprising: extracting one or more entities identified in a plurality of data artifacts based at least in part on one or more datasets; extracting one or more entities identified in a plurality of code artifacts based at least in part on the one or more datasets; extracting one or more entities identified in a plurality of user interface artifacts based at least in part on the one or more datasets; generating a set of dependency graphs each based at least in part on one or more relationships among the respective extracted one or more entities; and performing one or more of a lexical analysis and a semantic analysis on the set of dependency graphs to identify a data domain of the one or more datasets; wherein the method is carried out by at least one computing device. 10. The computer-implemented method of claim 9 , wherein the plurality of data artifacts comprises one or more of (a) one or more schemas with their table names and associated column names, (b) index, trigger, and stored procedures associated with the schemas, (c) relationships between the different tables and databases, (d) table data, and (e) documentation, logs, performance and operational profile of datasets. 11. The computer-implemented method of claim 9 , wherein the plurality of code artifacts comprises source code and associated libraries. 12. The computer-implemented method of claim 9 , wherein the plurality of user interface artifacts comprises one or more of (a) user interface screens with natural language text, (b) user interface form objects and formatting, and (c) user interface modalities. 13. The computer-implemented method of claim 9 , wherein generating a set of dependency graphs each based at least in part on one or more relationships among the respective extracted one or more entities comprises generating a first dependency graph of the set of dependency graphs based at least in part on one or more relationships between the extracted one or more entities of the data artifacts and the extracted one or more entities of the code artifacts. 14. The computer-implemented method of claim 13 , wherein generating a set of dependency graphs each based at least in part on one or more relationships among the respective extracted one or more entities further comprises generating a second dependency graph of the set of dependency graphs based at least in part on one or more relationships between the extracted one or more entities of the data artifacts and the extracted one or more entities of the user interface artifacts. 15. The computer-implemented method of claim 14 , wherein generating a set of dependency graphs each based at least in part on one or more relationships among the respective extracted one or more entities further comprises generating a third dependency graph of the set of dependency graphs based at least in part on one or more relationships between the extracted one or more entities of the code artifacts and the extracted one or more entities of the user interface artifacts. 16. The computer-implemented method of claim 9 , further comprising: retrieving the data artifacts from at least one of a database and a file system; and applying one of a data definition language operation and a data manipulation language operation to identify the one or more entities in each data artifact and determine one or more relationships between the one or more entities. 17. A computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computing device to cause the computing device to: extract one or more entities identified in a plurality of data artifacts based at least in part on one or more datasets; extract one or more entities identified in a plurality of code artifacts based at least in part on the one or more datasets; extract one or more entities identified in a plurality of user interface artifacts based at least in part on the one or more datasets; generate a set of dependency graphs each based at least in part on one or more relationships among the respec

Assignees

Inventors

Classifications

  • Dictionaries · CPC title

  • G06F40/30Primary

    Semantic analysis · CPC title

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12333253B2 cover?
An apparatus is disclosed which includes at least one processing device comprising a processor coupled to a memory. The at least one processing device, when executing program code, is configured to: extract one or more entities identified in a plurality of data artifacts based at least in part on one or more datasets, extract one or more entities identified in a plurality of code artifacts base…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 17 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).