Boundary detection for synthetic data generation

US12399912B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-12399912-B1
Application numberUS-202418762609-A
CountryUS
Kind codeB1
Filing dateJul 2, 2024
Priority dateJul 2, 2024
Publication dateAug 26, 2025
Grant dateAug 26, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method can determine a linguistic boundary condition for synthetic data generation in a multi-class classification problem. The method includes analyzing empirical labelled data using linguistic and vector representation techniques. The method further includes deriving a synthetic boundary conditional (SBC) model based on the analysis of the empirical labelled data and identifying a boundary location for performant synthetic data generation using the SBC model.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for determining a linguistic boundary condition for synthetic data generation in a multi-class classification problem, the method comprising: analyzing empirical labelled data using linguistic and vector representation techniques; deriving a synthetic boundary conditional (SBC) model based on the analyzing of the empirical labelled data; and identifying a boundary location for performant synthetic data generation using the SBC model, comprising: determining, by the SBC model, key collocations and key entities within each dataset class of the empirical labelled data; and calculating 2D coordinates of the key collocations and the key entities. 2. The computer-implemented method of claim 1 , wherein the boundary location is defined by cartesian coordinates generated by the SBC model. 3. The computer-implemented method of claim 1 , wherein the analyzing of the empirical labelled data includes using corpus linguistics to determine key collocations of the empirical labelled data. 4. The computer-implemented method of claim 3 , wherein the analyzing of the empirical labelled data includes using named entity recognition to determine key entities within each label category. 5. The computer-implemented method of claim 4 , wherein the analyzing of the empirical labelled data includes representing key collocations and key entities in a multi-dimension space using word embeddings. 6. A computer-implemented method for determining a linguistic boundary condition for synthetic data generation in a multi-class classification problem, the method comprising: analyzing empirical labelled data using linguistic and vector representation techniques; deriving a synthetic boundary conditional (SBC) model based on the analyzing of the empirical labelled data, comprising: performing corpus linguistic analysis to determine an order and arrangement of terms in the empirical labelled data; performing name entity recognition (NER) to determine key entities of the empirical labelled data; and mapping all words from multi-dimension word embeddings into a 2-dimensional cartesian space; and identifying a boundary location for performant synthetic data generation using the SBC model. 7. The computer-implemented method of claim 1 , further comprising using the SBC model to measure a quality of synthetic data by calculating a similarity metric using the boundary location. 8. The computer-implemented method of claim 1 , further comprising using the SBC model to pass synthetic data requirements to a large language model for synthetic data generation. 9. A system comprising: a processor; a memory coupled to the processor; and a computer readable storage embodying a computer program code, the computer program code comprising instructions for determining a linguistic boundary condition for synthetic data generation in a multi-class classification problem, the instructions executable by the processor and configured to: analyze empirical labelled data using linguistic and vector representation techniques; derive a synthetic boundary conditional (SBC) model based on the analyzing of the empirical labelled data; and identify a boundary location for performant synthetic data generation using the SBC model, wherein the instructions are further configured to derive the SBC model by: performing corpus linguistic analysis to determine an order and arrangement of terms in the empirical labelled data; performing name entity recognition (NER) to determine key entities of the empirical labelled data; and mapping all words using word embeddings into a 2-dimensional cartesian space. 10. The system of claim 9 , wherein the boundary location is defined by cartesian coordinates generated by the SBC model. 11. The system of claim 9 , wherein the instructions are further configured to analyze the empirical labelled data by: using corpus linguistics to determine key collocations of the empirical labelled data; using named entity recognition to determine key entities within each label category; and representing key collocations and key entities in a 2D cartesian space mapped from higher dimension word embeddings. 12. The system of claim 9 , wherein the instructions are further configured identify the boundary location using the SBC model by: determining, by the SBC model, key collocations and key entities within each dataset class of the empirical labelled data; and calculating 2D coordinates of the key collocations and the key entities. 13. The system of claim 9 , wherein the instructions are further configured to use the SBC model to measure a quality of synthetic data by calculating a similarity metric using the boundary location. 14. The system of claim 9 , wherein the instructions are further configured to use the SBC model to pass synthetic data requirements to a large language model for synthetic data generation. 15. A computer program product for determining a linguistic boundary condition for synthetic data generation in a multi-class classification problem, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to: analyze empirical labelled data using linguistic and vector representation techniques; derive a synthetic boundary conditional (SBC) model based on the analyzing of the empirical labelled data; and identify a boundary location, defined by 2D cartesian coordinates, for performant synthetic data generation using the SBC model by: determining, by the SBC model, key collocations and key entities within each dataset class of the empirical labelled data; and calculating the 2D coordinates of the key collocations and the key entities. 16. The computer program product of claim 15 , wherein the program instructions are further configured to analyze the empirical labelled data by: using corpus linguistics to determine key collocations of the empirical labelled data; using named entity recognition to determine key entities within each label category; and representing key collocations and key entities in a 2D cartesian space mapped from higher dimension word embeddings. 17. The computer program product of claim 15 , wherein the program instructions are further configured to derive the SBC model by: performing corpus linguistic analysis to determine an order and arrangement of terms in the empirical labelled data; performing name entity recognition (NER) to determine key entities of the empirical labelled data; and mapping all words using higher-dimension word embeddings into a 2-dimensional cartesian space.

Assignees

Inventors

Classifications

  • G06F16/285Primary

    Clustering or classification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12399912B1 cover?
A computer-implemented method can determine a linguistic boundary condition for synthetic data generation in a multi-class classification problem. The method includes analyzing empirical labelled data using linguistic and vector representation techniques. The method further includes deriving a synthetic boundary conditional (SBC) model based on the analysis of the empirical labelled data and id…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 26 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).