Semi-supervised learning with group constraints

US11880755B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11880755-B2
Application numberUS-202015931706-A
CountryUS
Kind codeB2
Filing dateMay 14, 2020
Priority dateMay 14, 2020
Publication dateJan 23, 2024
Grant dateJan 23, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for classification of data by a machine learning system using a logic constraint for reducing a data labeling requirement. The computer-implemented method includes: generating a first embedding space from a first partially labeled training data set, wherein in the first embedding space, content-wise related training data of the first partially labeled training data are clustered together, determining at least two clusters in the first embedding space formed from the first partially labeled training data, and training a machine learning model based, at least in part, on a second partially labeled training data set and the at least two clusters, wherein the at least two clusters are used as training constraints.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for classification of data by a machine learning system, the computer-implemented-method comprising: generating a first and a second partially labelled training data set from a plurality of obtained data, wherein said first and second partially labelled data set are a subset of said plurality of obtained data and said first and second data set have a content and a context that are not identical but related to one another; generating a plurality of logic constraints based on said content and context of said first and second partially labelled data set using at least one statistical analysis model; generating a first embedding space from said first partially labeled training data set, wherein in the first embedding space, content relates to the first partially labeled training data; determining at least two clusters in the first embedding space formed from the first partially labeled training data, wherein the at least two clusters are determined using at least one hyper-parameter associated with a clustering algorithm; training a support vector machine based, at least in part, on a second partially labeled training data set and the at least two clusters wherein the at least two clusters are used as training constraints; determining a parameter value is below a threshold parameter value; and performing one or more repetitions using a predefined performance criterion, wherein the predefined performance criterion changes the at least one hyper-parameter of the clustering algorithm, responsive to determining the parameter value is below the threshold parameter value, until an optimal quality parameter value is reached. 2. The computer-implemented method of claim 1 , wherein the first partially labeled training data set and the second partially labeled training data set relate to each other content-wise, but are not identical. 3. The computer-implemented method of claim 1 , wherein said statistical analysis model includes algorithm selected from the group consisting of: k-means, Gaussian mixed model, DB SCAN, expectation maximization, and hierarchical clustering. 4. The computer-implemented method of claim 1 , wherein the first embedding space is generated based, at least in part, on at least one of an auto-encoder and a word2vec algorithm. 5. The computer-implemented method of claim 1 , further comprising: determining the parameter value for the machine learning model from a plurality of labeled validation samples. 6. The computer-implemented method of claim 1 , wherein a format of the first partially labeled training data is selected from the group consisting of a text format, a sound format, an image format, and a video format. 7. The computer-implemented method of claim 1 , wherein a format of the second partially labeled training data is selected from the group consisting of a text format, a sound format, an image format, and a video format. 8. The computer-implemented method of claim 1 , wherein generating the first embedding space from the first partially labeled training data set further comprises: generating a second embedding space from the second partially labeled training data set, wherein in the second embedding space, content-wise related training data of the first partially labeled training data are clustered together. 9. The computer-implemented method of claim 1 , wherein generating a second embedding space from the second partially labeled training data set further comprises: generating a third embedding space from a third partially labeled training data set, wherein in the third embedding space, content-wise related training data of the second partially labeled training data are clustered together. 10. A computer system for classification of data by a machine learning system, the computer system comprising a processor configured to: generating a first and a second partially labelled training data set from a plurality of obtained data, wherein said first and second partially labelled data set are a subset of said plurality of obtained data and said first and second data set have a content and a context that are not identical but related to one another; generating a plurality of logic constraints based on said content and context of said first and second partially labelled data set using at least one statistical analysis model; generating a first embedding space from said first partially labeled training data set, wherein in the first embedding space, content relates to the first partially labeled training data; determining at least two clusters in the first embedding space formed from the first partially labeled training data, wherein the at least two clusters are determined using at least one hyper-parameter associated with a clustering algorithm; training a support vector machine based, at least in part, on a second partially labeled training data set and the at least two clusters, wherein the at least two clusters are used as training constraints; determining a parameter value is below a threshold parameter value; and performing one or more repetitions using a predefined performance criterion, wherein the predefined performance criterion changes the at least one hyper-parameter of the clustering algorithm, responsive to determining the parameter value is below the threshold parameter value, until an optimal quality parameter value is reached. 11. The computer system of claim 10 , wherein the first partially labeled training data set and the second partially labeled training data set relate to each other content-wise, but are not identical. 12. The computer system of claim 10 , wherein said statistical analysis model includes algorithm selected from the group consisting of: k-means, Gaussian mixed model, DB SCAN, expectation maximization, and hierarchical clustering. 13. The computer system of claim 10 , wherein the first embedding space is generated based, at least in part, on at least one of an auto-encoder and a word 2 vec algorithm. 14. The computer system according to claim 10 , further comprising: determining a parameter value for the machine learning model from a plurality of labeled validation samples. 15. The computer system of claim 10 , wherein a format of the first partially labeled training data is selected from the group consisting of a text format, a sound format, an image format, and a video format. 16. The computer system of claim 10 , wherein a format of the second partially labeled training data is selected from the group consisting of a text format, a sound format, an image format, and a video format. 17. A computer program product for a classification of data by a machine learning system, the computer program product comprising one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the program instructions including instructions to: generate a first and a second partially labelled training data set from a plurality of obtained data, wherein said first and second partially labelled data set are a subset of said plurality of obtained data and said first and second data set have a content and a context that are not identical but related to one another; generate a plurality of logic constraints based on said content and context of said first and second partially labelled data set using at least one statistical analysis model; generate a first embedding space from said first partially labeled training data set, wherein in the first embedding space, content relates to the first partially labeled training data

Assignees

Inventors

Classifications

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Auto-encoder networks; Encoder-decoder networks · CPC title

  • Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title

  • G06N20/10Primary

    using kernel methods, e.g. support vector machines [SVM] · CPC title

  • Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11880755B2 cover?
A computer-implemented method for classification of data by a machine learning system using a logic constraint for reducing a data labeling requirement. The computer-implemented method includes: generating a first embedding space from a first partially labeled training data set, wherein in the first embedding space, content-wise related training data of the first partially labeled training data…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06N20/10. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 23 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).