Independent and dependent reading using recurrent networks for natural language inference
US-2020320387-A1 · Oct 8, 2020 · US
US11250299B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11250299-B2 |
| Application number | US-201916668680-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 30, 2019 |
| Priority date | Nov 1, 2018 |
| Publication date | Feb 15, 2022 |
| Grant date | Feb 15, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method is provided for determining entailment between an input premise and an input hypothesis of different modalities. The method includes extracting features from the input hypothesis and an entirety of and regions of interest in the input premise. The method further includes deriving intra-modal relevant information while suppressing intra-modal irrelevant information, based on intra-modal interactions between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise. The method also includes attaching cross-modal relevant information to the features from the input premise to the features from the input hypothesis to form a cross-modal representation, based on cross-modal interactions between pairs of different elementary features from different modalities. The method additionally includes classifying a relationship between the input premise and the input hypothesis using a label selected from the group consisting of entailment, neutral, and contradiction based on the cross-modal representation.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for determining entailment between an input premise and an input hypothesis of different modalities, comprising: extracting, by a hardware processor applying a Mask Region Convolutional Neural Network (R-CNN), features from the input hypothesis and an entirety of and regions of interest in the input premise; deriving, by the hardware processor, intra-modal relevant information while suppressing intra-modal irrelevant information, based on intra-modal interactions between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise; attaching, by the hardware processor, cross-modal relevant information to the features from the input premise to the features from the input hypothesis by deriving, using a gated recurrent unit, overall sentence hypothesis features from an output of a text self-attention process to form a cross-modal representation, based on cross-modal interactions between pairs of different elementary features from different modalities; providing labels consisting of entailment, neutral, and contradiction for classification usage; and classifying, by the hardware processor, a relationship between the input premise and the input hypothesis using a label selected from a group consisting of entailment, neutral, and contradiction based on the cross-modal representation. 2. The computer-implemented method of claim 1 , wherein the hardware process extracts the features from the entirety of the input image premise using a Convolutional Neural Network. 3. The computer-implemented method of claim 2 , wherein the hardware process extracts the features from the regions of interest in the input image premise using a Mask Region-based Convolutional Neural Network. 4. The computer-implemented method of claim 1 , wherein said deriving step use a self-attention process to identify the elementary ones of the features from an entirety of the features. 5. The computer-implemented method of claim 1 , wherein said extracting step comprises extracting region specific feature vectors for the input premise. 6. The computer-implemented method of claim 1 , wherein the regions of interest are specified at a feature map level. 7. The computer-implemented method of claim 1 , wherein the regions of interest are specified at a semantic level. 8. The computer-implemented method of claim 1 , wherein said extracting step comprises forming a visual corpus from an existing textual corpus that includes textual premises and textual hypothesis by replacing the textual premises in the existing textual corpus with visual premises. 9. The computer-implemented method of claim 1 , wherein the intra-modal relevant information is derived by performing a word embedding on the input textual sequence to obtain a vector of real numbers, and subjecting the vector of real numbers to a self-attention process. 10. The computer-implemented method of claim 1 , wherein the relationship between the input premise and the input hypothesis is classified using a softmax process. 11. The computer-implemented method of claim 1 , wherein the input premise comprises an input image premise, and the input hypothesis comprises an input textual sequence hypothesis. 12. A computer program product for determining entailment between an input premise and an input hypothesis of different modalities, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: extracting, by a hardware processor applying a Mask Region Convolutional Neural Network (R-CNN), features from the input hypothesis and an entirety of and regions of interest in the input premise; deriving, by the hardware processor, intra-modal relevant information while suppressing intra-modal irrelevant information, based on intra-modal interactions between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise; attaching, by the hardware processor, cross-modal relevant information to the features from the input premise to the features from the input hypothesis by deriving, using a gated recurrent unit, overall sentence hypothesis features from an output of a text self-attention process to form a cross-modal representation, based on cross-modal interactions between pairs of different elementary features from different modalities; providing labels consisting of entailment, neutral, and contradiction for classification usage; and classifying, by the hardware processor, a relationship between the input premise and the input hypothesis using a label selected from a group consisting of entailment, neutral, and contradiction based on the cross-modal representation. 13. The computer program product of claim 12 , wherein the hardware process extracts the features from the entirety of the input image premise using a Convolutional Neural Network. 14. The computer program product of claim 13 , wherein the hardware process extracts the features from the regions of interest in the input image premise using a Mask Region-based Convolutional Neural Network. 15. The computer program product of claim 12 , wherein said deriving step use a self-attention process to identify the elementary ones of the features from an entirety of the features. 16. The computer program product of claim 12 , wherein said extracting step comprises extracting region specific feature vectors for the input premise. 17. The computer program product of claim 12 , wherein the regions of interest are specified at a feature map level. 18. The computer program product of claim 12 , wherein the regions of interest are specified at a semantic level. 19. A computer processing system for determining entailment between an input premise and an input hypothesis of different modalities, comprising: a memory device including program code stored thereon; a hardware processor, operatively coupled to the memory device, and configured to run the program code stored on the memory device to extract, by applying a Mask Region Convolutional Neural Network (R-CNN), features from the input hypothesis and an entirety of and regions of interest in the input premise; derive intra-modal relevant information while suppressing intra-modal irrelevant information, based on intra-modal interactions between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise; attach cross-modal relevant information to the features from the input premise to the features from the input hypothesis by deriving, using a gated recurrent unit, overall sentence hypothesis features from an output of a text self-attention process to form a cross-modal representation, based on cross-modal interactions between pairs of different elementary features from different modalities; providing labels consisting of entailment, neutral, and contradiction for classification usage; and classify a relationship between the input premise and the input hypothesis using a label selected from a group consisting of entailment, neutral, and contradiction based on the cross-modal representation.
of news video content · CPC title
Overlay text, e.g. embedded captions in a TV programme · CPC title
Classification techniques · CPC title
using neural networks · CPC title
Classification techniques · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.