Feature reweighting in text classifier generation using unlabeled data

US11216619B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11216619-B2
Application numberUS-202016860565-A
CountryUS
Kind codeB2
Filing dateApr 28, 2020
Priority dateApr 28, 2020
Publication dateJan 4, 2022
Grant dateJan 4, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A mechanism is provided to implement a text classifier training augmentation mechanism for incorporating unlabeled data into the generation of a text classifier. For each term of a plurality of terms in each document of a plurality of documents in a set of unlabeled data, a term frequency value is determined. The term is normalized by dividing the term frequency value by a total number of terms in the document. An inverse document frequency (idf) value is determined for each term based on the term frequency value. A subset of terms is filtered from the plurality of terms based the determined idf values. The idf values for the remaining terms are transformed into feature weights. Terms from a set of labeled data are re-weighted based on the feature weights determined from the set of unlabeled data. The text classifier is then generated using the re-weighted labeled data.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, in a data processing system, comprising at least one processor and at least one memory, wherein the at least one memory comprises instructions that are executed by the at least one processor to configure the at least one processor to implement a text classifier training augmentation mechanism for incorporating unlabeled data in addition to labeled data into the generation of a text classifier, the method comprising: determining, by the text classifier training augmentation mechanism, an inverse document frequency (idf) value for each term in a plurality of terms in a set of unlabeled data; re-weighting, by the text classifier training augmentation mechanism, terms from a set of labeled data based on the idf values for the plurality of terms in the set of unlabeled data; generating, by the text classifier training augmentation mechanism, a set of normalized sample reweights based on a similarity between each sentence in the set of labeled data and each sentence in the set of unlabeled data; generating, by the text classifier training augmentation mechanism, a set of augmented sentences based on the plurality of sentences in the set of unlabeled data; performing, by the text classifier training augmentation mechanism, an inter-sample agreement check to identify a consistency loss value between the plurality of sentences in the set of unlabeled data and the set of augmented sentences; and generating, by a machine learning mechanism, the text classifier using the re-weighted labeled data, the set of normalized sample reweights, and the consistency loss value, wherein the machine learning mechanism generates the text classifier using the plurality of sentences in the set of unlabeled data, the set of augmented sentences from the set of unlabeled data, and the consistency loss value using the following loss function: Loss(original example)+alpha*Loss(weighted example)+gamma*Consistency_loss(unlabeled samples) where alpha and gamma are hyperparameters that are user configurable. 2. The method of claim 1 , further comprising: weighing down, by the text classifier training augmentation mechanism, frequent terms in the plurality of terms while scaling up rare terms in the plurality of terms by computing the idf value for the term using the following equation: IDF ⁡ ( t ) = log ⁢ ⁢ _ ⁢ ⁢ e ( Total ⁢ ⁢ ⁢ number ⁢ ⁢ of ⁢ ⁢ ⁢ documents Number ⁢ ⁢ of ⁢ ⁢ documents ⁢ ⁢ with ⁢ ⁢ term ⁢ ⁢ t ) . 3. The method of claim 1 , wherein generating the set of normalized sample reweights further comprises: generating, by the text classifier training augmentation mechanism, a sentence representation for each sentence of a plurality of sentences in the set of unlabeled data; computing, by the text classifier training augmentation mechanism, a cosine similarity between each sentence representation of a plurality of sentences in the set of labeled data and each sentence representation of the set of unlabeled data; determining, by the text classifier training augmentation mechanism, a weighted sum of the similarities for each sentence in the set of labeled data; and normalizing, by the text classifier training augmentation mechanism, the weighted sums over all the plurality of sentences in the labeled data thereby producing the set of normalized sample reweights. 4. The method of claim 3 , wherein the Loss(weighted example) is equal to a learned-weight multiplied by the Loss(original example). 5. The method of claim 1 , wherein performing the inter-sample agreement check further comprises: generating, by the text classifier training augmentation mechanism, a sentence representation for each sentence in the set of unlabeled data thereby generating the set of augmented sentences; and identifying, by the text classifier training augmentation mechanism, a prediction distribution between the plurality of sentences in the set of unlabeled data and the set of augmented sentences from the set of unlabeled data. 6. The method of claim 5 , wherein the Consistency_loss is an inverse to a similarity between at least one generated sentence and at least one associated unlabeled sentence. 7. The method of claim 1 , wherein determining the inverse document frequency (idf) value for each term in the plurality of terms comprises: for each term of the plurality of terms in the set of unlabeled data, determining a term frequency value; normalizing the term by dividing the term frequency value by a total number of terms in the document. 8. The method of claim 7 , further comprising: filtering a subset of terms from the plurality of terms based the determined idf values; and transforming the idf values for the remaining terms into feature weights. 9. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program, when executed on a data processing system, causes the data processing system to implement a text classifier training augmentation mechanism for incorporating unlabeled data in addition to labeled data into the generation of a text classifier, and further causes the data processing system to: determine an inverse document frequency (idf) value for each term in a plurality of terms in

Assignees

Inventors

Classifications

  • Combinations of networks · CPC title

  • Weakly supervised learning, e.g. semi-supervised or self-supervised learning · CPC title

  • Supervised learning · CPC title

  • Learning methods · CPC title

  • G06F40/30Primary

    Semantic analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11216619B2 cover?
A mechanism is provided to implement a text classifier training augmentation mechanism for incorporating unlabeled data into the generation of a text classifier. For each term of a plurality of terms in each document of a plurality of documents in a set of unlabeled data, a term frequency value is determined. The term is normalized by dividing the term frequency value by a total number of terms…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 04 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).