Information processing apparatus
US-2021012195-A1 · Jan 14, 2021 · US
US11880755B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11880755-B2 |
| Application number | US-202015931706-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 14, 2020 |
| Priority date | May 14, 2020 |
| Publication date | Jan 23, 2024 |
| Grant date | Jan 23, 2024 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer-implemented method for classification of data by a machine learning system using a logic constraint for reducing a data labeling requirement. The computer-implemented method includes: generating a first embedding space from a first partially labeled training data set, wherein in the first embedding space, content-wise related training data of the first partially labeled training data are clustered together, determining at least two clusters in the first embedding space formed from the first partially labeled training data, and training a machine learning model based, at least in part, on a second partially labeled training data set and the at least two clusters, wherein the at least two clusters are used as training constraints.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for classification of data by a machine learning system, the computer-implemented-method comprising: generating a first and a second partially labelled training data set from a plurality of obtained data, wherein said first and second partially labelled data set are a subset of said plurality of obtained data and said first and second data set have a content and a context that are not identical but related to one another; generating a plurality of logic constraints based on said content and context of said first and second partially labelled data set using at least one statistical analysis model; generating a first embedding space from said first partially labeled training data set, wherein in the first embedding space, content relates to the first partially labeled training data; determining at least two clusters in the first embedding space formed from the first partially labeled training data, wherein the at least two clusters are determined using at least one hyper-parameter associated with a clustering algorithm; training a support vector machine based, at least in part, on a second partially labeled training data set and the at least two clusters wherein the at least two clusters are used as training constraints; determining a parameter value is below a threshold parameter value; and performing one or more repetitions using a predefined performance criterion, wherein the predefined performance criterion changes the at least one hyper-parameter of the clustering algorithm, responsive to determining the parameter value is below the threshold parameter value, until an optimal quality parameter value is reached. 2. The computer-implemented method of claim 1 , wherein the first partially labeled training data set and the second partially labeled training data set relate to each other content-wise, but are not identical. 3. The computer-implemented method of claim 1 , wherein said statistical analysis model includes algorithm selected from the group consisting of: k-means, Gaussian mixed model, DB SCAN, expectation maximization, and hierarchical clustering. 4. The computer-implemented method of claim 1 , wherein the first embedding space is generated based, at least in part, on at least one of an auto-encoder and a word2vec algorithm. 5. The computer-implemented method of claim 1 , further comprising: determining the parameter value for the machine learning model from a plurality of labeled validation samples. 6. The computer-implemented method of claim 1 , wherein a format of the first partially labeled training data is selected from the group consisting of a text format, a sound format, an image format, and a video format. 7. The computer-implemented method of claim 1 , wherein a format of the second partially labeled training data is selected from the group consisting of a text format, a sound format, an image format, and a video format. 8. The computer-implemented method of claim 1 , wherein generating the first embedding space from the first partially labeled training data set further comprises: generating a second embedding space from the second partially labeled training data set, wherein in the second embedding space, content-wise related training data of the first partially labeled training data are clustered together. 9. The computer-implemented method of claim 1 , wherein generating a second embedding space from the second partially labeled training data set further comprises: generating a third embedding space from a third partially labeled training data set, wherein in the third embedding space, content-wise related training data of the second partially labeled training data are clustered together. 10. A computer system for classification of data by a machine learning system, the computer system comprising a processor configured to: generating a first and a second partially labelled training data set from a plurality of obtained data, wherein said first and second partially labelled data set are a subset of said plurality of obtained data and said first and second data set have a content and a context that are not identical but related to one another; generating a plurality of logic constraints based on said content and context of said first and second partially labelled data set using at least one statistical analysis model; generating a first embedding space from said first partially labeled training data set, wherein in the first embedding space, content relates to the first partially labeled training data; determining at least two clusters in the first embedding space formed from the first partially labeled training data, wherein the at least two clusters are determined using at least one hyper-parameter associated with a clustering algorithm; training a support vector machine based, at least in part, on a second partially labeled training data set and the at least two clusters, wherein the at least two clusters are used as training constraints; determining a parameter value is below a threshold parameter value; and performing one or more repetitions using a predefined performance criterion, wherein the predefined performance criterion changes the at least one hyper-parameter of the clustering algorithm, responsive to determining the parameter value is below the threshold parameter value, until an optimal quality parameter value is reached. 11. The computer system of claim 10 , wherein the first partially labeled training data set and the second partially labeled training data set relate to each other content-wise, but are not identical. 12. The computer system of claim 10 , wherein said statistical analysis model includes algorithm selected from the group consisting of: k-means, Gaussian mixed model, DB SCAN, expectation maximization, and hierarchical clustering. 13. The computer system of claim 10 , wherein the first embedding space is generated based, at least in part, on at least one of an auto-encoder and a word 2 vec algorithm. 14. The computer system according to claim 10 , further comprising: determining a parameter value for the machine learning model from a plurality of labeled validation samples. 15. The computer system of claim 10 , wherein a format of the first partially labeled training data is selected from the group consisting of a text format, a sound format, an image format, and a video format. 16. The computer system of claim 10 , wherein a format of the second partially labeled training data is selected from the group consisting of a text format, a sound format, an image format, and a video format. 17. A computer program product for a classification of data by a machine learning system, the computer program product comprising one or more computer readable storage media and program instructions stored on the one or more computer readable storage media, the program instructions including instructions to: generate a first and a second partially labelled training data set from a plurality of obtained data, wherein said first and second partially labelled data set are a subset of said plurality of obtained data and said first and second data set have a content and a context that are not identical but related to one another; generate a plurality of logic constraints based on said content and context of said first and second partially labelled data set using at least one statistical analysis model; generate a first embedding space from said first partially labeled training data set, wherein in the first embedding space, content relates to the first partially labeled training data
Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title
Auto-encoder networks; Encoder-decoder networks · CPC title
Hyperparameter optimisation; Meta-learning; Learning-to-learn · CPC title
using kernel methods, e.g. support vector machines [SVM] · CPC title
Generating training patterns; Bootstrap methods, e.g. bagging or boosting · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.