Unsupervised cross-domain data augmentation for long-document based prediction and explanation

US12321841B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12321841-B2
Application numberUS-202217972167-A
CountryUS
Kind codeB2
Filing dateOct 24, 2022
Priority dateOct 24, 2022
Publication dateJun 3, 2025
Grant dateJun 3, 2025

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Unsupervised cross-domain data augmentation techniques for long-text document based prediction and explanation are provided. In one aspect, a system for long-document based prediction includes: an encoder for creating embeddings of long-document texts with hierarchical sparse self-attention, and making predictions using the embeddings of the long-document texts; and a multi-source counterfactual augmentation module for generating perturbed long-document texts using unlabeled sentences from at least one external source to train the encoder. A method for long-document based prediction is also provided.

First claim

Opening claim text (preview).

What is claimed is: 1. A system for long-document based prediction, comprising: an encoder for creating embeddings of long-document texts with hierarchical sparse self-attention, and making predictions using the embeddings of the long-document texts, wherein the hierarchical sparse self-attention comprises multiple stacked layers of multi-head sparse self attention and one-dimensional convolutional filters with parameterized activation functions to capture long-range sentence-level dependencies; and wherein the encoder implements a sparsity matrix to filter out trivial attention weights and enable focus on attentively important sentences; and a multi-source counterfactual augmentation module for generating perturbed long-document texts using unlabeled sentences from at least one external source to train the encoder, wherein the multi-source counterfactual augmentation module enforces both semantic alignment through topic classification and task alignment through influence function scoring, and wherein a semi-supervised training protocol is used that alternates between supervised learning of the encoder and augmentation using multi-source data through multiple rounds until convergence, with each round comprising forty epochs of supervised encoder training followed by forty epochs of augmentation training, and wherein a bidirectional Kullback-Leibler (KL) regularization component is introduced to reduce model overfitting by enforcing consistency between output distributions of different sub-models generated by dropout, with a hyperparameter a controlling the weight of the KL divergence terms. 2. The system of claim 1 , wherein the long-document texts comprise more than 500 sentences. 3. The system of claim 1 , wherein the long-document texts comprise earnings call transcripts, and wherein the embeddings comprise embeddings of the earnings call transcripts. 4. The system of claim 3 , wherein the encoder comprises a predictor with a fully-connected layer for predicting a significance level of market volatility over n-days following an earnings call using the embeddings of the earnings call transcripts. 5. The system of claim 3 , wherein the at least one external source comprises financial news. 6. The system of claim 1 , wherein the multi-source counterfactual augmentation module comprises a topic classifier for identifying salient sentences in the long-document texts for perturbation; and linking the salient sentences in the long-document texts to the unlabeled sentences from the at least one other source through topics. 7. The system of claim 6 , wherein the multi-source counterfactual augmentation module comprises an unsupervised counterfactual augmentation module for replacing one of the salient sentences of the long-document texts with one of the unlabeled sentences from the at least one external source as a perturbation, and determining a degree by which the replacing changes the predictions. 8. The system of claim 7 , wherein the determining is performed using example-based model explanation. 9. A method for long-document based prediction, the method comprising: creating, by an encoder, embeddings of long-document texts with hierarchical sparse self-attention, wherein the hierarchical sparse self-attention comprises multiple stacked layers of multi-head sparse self attention and one-dimensional convolutional filters with parameterized activation functions to capture long-range sentence-level dependencies; and wherein the encoder implements a sparsity matrix to filter out trivial attention weights and enable focus on attentively important sentences; training the encoder using perturbed long-document texts generated by counterfactual augmentation with unlabeled sentences from at least one external source, wherein the multi-source counterfactual augmentation module enforces both semantic alignment through topic classification and task alignment through influence function scoring, and wherein a semi-supervised training protocol is used that alternates between supervised learning of the encoder and augmentation using multi-source data through multiple rounds until convergence, with each round comprising forty epochs of supervised encoder training followed by forty epochs of augmentation training, and wherein a bidirectional Kullback-Leibler (KL) regularization component is introduced to reduce model overfitting by enforcing consistency between output distributions of different sub-models generated by dropout, with a hyperparameter a controlling the weight of the KL divergence terms; and making predictions, by the encoder, using the embeddings of the long-document texts. 10. The method of claim 9 , wherein the long-document texts comprise more than 500 sentences. 11. The method of claim 9 , wherein the long-document texts comprise earnings call transcripts, and wherein the embeddings comprise embeddings of the earnings call transcripts. 12. The method of claim 11 , further comprising: predicting a significance level of market volatility over n-days following an earnings call using the embeddings of the earnings call transcripts. 13. The method of claim 11 , wherein the at least one external source comprises financial news. 14. The method of claim 9 , further comprising: identifying salient sentences in the long-document texts for perturbation; and linking the salient sentences in the long-document texts to the unlabeled sentences from the at least one other source through topics. 15. The method of claim 14 , further comprising: replacing one of the salient sentences of the long-document texts with one of the unlabeled sentences from at least one other source as a perturbation. 16. The method of claim 15 , further comprising: determining a degree by which the replacing changes the predictions. 17. The method of claim 16 , wherein the determining is performed using example-based model explanation. 18. A computer program product for long-document based prediction, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform: creating, by an encoder, embeddings of long-document texts with hierarchical sparse self-attention, wherein the hierarchical sparse self-attention comprises multiple stacked layers of multi-head sparse self attention and one-dimensional convolutional filters with parameterized activation functions to capture long-range sentence-level dependencies; and wherein the encoder implements a sparsity matrix to filter out trivial attention weights and enable focus on attentively important sentences; training the encoder using perturbed long-document texts generated by counterfactual augmentation with unlabeled sentences from at least one external source, wherein the multi-source counterfactual augmentation module enforces both semantic alignment through topic classification and task alignment through influence function scoring, and wherein a semi-supervised training protocol is used that alternates between supervised learning of the encoder and augmentation using multi-source data through multiple rounds until convergence, with each round comprising forty epochs of supervised encoder training followed by forty epochs of augmentation training, and wherein a bidirectional Kullback-Leibler (KL) regularization component is introduced to reduce model overfitting by enforcing consistency between output distributions of different sub-models generated by dropout, with a hyperparameter a controlling the weight

Assignees

Inventors

Classifications

  • Vector coding (for television signals, see H04N19/94) · CPC title

  • Character encoding · CPC title

  • Semantic analysis · CPC title

  • G06F40/166Primary

    Editing, e.g. inserting or deleting · CPC title

  • Market predictions or forecasting for commercial activities · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12321841B2 cover?
Unsupervised cross-domain data augmentation techniques for long-text document based prediction and explanation are provided. In one aspect, a system for long-document based prediction includes: an encoder for creating embeddings of long-document texts with hierarchical sparse self-attention, and making predictions using the embeddings of the long-document texts; and a multi-source counterfactua…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F40/166. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 03 2025 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).