Learning representations of generalized cross-modal entailment tasks

US11250299B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11250299-B2
Application numberUS-201916668680-A
CountryUS
Kind codeB2
Filing dateOct 30, 2019
Priority dateNov 1, 2018
Publication dateFeb 15, 2022
Grant dateFeb 15, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method is provided for determining entailment between an input premise and an input hypothesis of different modalities. The method includes extracting features from the input hypothesis and an entirety of and regions of interest in the input premise. The method further includes deriving intra-modal relevant information while suppressing intra-modal irrelevant information, based on intra-modal interactions between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise. The method also includes attaching cross-modal relevant information to the features from the input premise to the features from the input hypothesis to form a cross-modal representation, based on cross-modal interactions between pairs of different elementary features from different modalities. The method additionally includes classifying a relationship between the input premise and the input hypothesis using a label selected from the group consisting of entailment, neutral, and contradiction based on the cross-modal representation.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for determining entailment between an input premise and an input hypothesis of different modalities, comprising: extracting, by a hardware processor applying a Mask Region Convolutional Neural Network (R-CNN), features from the input hypothesis and an entirety of and regions of interest in the input premise; deriving, by the hardware processor, intra-modal relevant information while suppressing intra-modal irrelevant information, based on intra-modal interactions between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise; attaching, by the hardware processor, cross-modal relevant information to the features from the input premise to the features from the input hypothesis by deriving, using a gated recurrent unit, overall sentence hypothesis features from an output of a text self-attention process to form a cross-modal representation, based on cross-modal interactions between pairs of different elementary features from different modalities; providing labels consisting of entailment, neutral, and contradiction for classification usage; and classifying, by the hardware processor, a relationship between the input premise and the input hypothesis using a label selected from a group consisting of entailment, neutral, and contradiction based on the cross-modal representation. 2. The computer-implemented method of claim 1 , wherein the hardware process extracts the features from the entirety of the input image premise using a Convolutional Neural Network. 3. The computer-implemented method of claim 2 , wherein the hardware process extracts the features from the regions of interest in the input image premise using a Mask Region-based Convolutional Neural Network. 4. The computer-implemented method of claim 1 , wherein said deriving step use a self-attention process to identify the elementary ones of the features from an entirety of the features. 5. The computer-implemented method of claim 1 , wherein said extracting step comprises extracting region specific feature vectors for the input premise. 6. The computer-implemented method of claim 1 , wherein the regions of interest are specified at a feature map level. 7. The computer-implemented method of claim 1 , wherein the regions of interest are specified at a semantic level. 8. The computer-implemented method of claim 1 , wherein said extracting step comprises forming a visual corpus from an existing textual corpus that includes textual premises and textual hypothesis by replacing the textual premises in the existing textual corpus with visual premises. 9. The computer-implemented method of claim 1 , wherein the intra-modal relevant information is derived by performing a word embedding on the input textual sequence to obtain a vector of real numbers, and subjecting the vector of real numbers to a self-attention process. 10. The computer-implemented method of claim 1 , wherein the relationship between the input premise and the input hypothesis is classified using a softmax process. 11. The computer-implemented method of claim 1 , wherein the input premise comprises an input image premise, and the input hypothesis comprises an input textual sequence hypothesis. 12. A computer program product for determining entailment between an input premise and an input hypothesis of different modalities, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising: extracting, by a hardware processor applying a Mask Region Convolutional Neural Network (R-CNN), features from the input hypothesis and an entirety of and regions of interest in the input premise; deriving, by the hardware processor, intra-modal relevant information while suppressing intra-modal irrelevant information, based on intra-modal interactions between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise; attaching, by the hardware processor, cross-modal relevant information to the features from the input premise to the features from the input hypothesis by deriving, using a gated recurrent unit, overall sentence hypothesis features from an output of a text self-attention process to form a cross-modal representation, based on cross-modal interactions between pairs of different elementary features from different modalities; providing labels consisting of entailment, neutral, and contradiction for classification usage; and classifying, by the hardware processor, a relationship between the input premise and the input hypothesis using a label selected from a group consisting of entailment, neutral, and contradiction based on the cross-modal representation. 13. The computer program product of claim 12 , wherein the hardware process extracts the features from the entirety of the input image premise using a Convolutional Neural Network. 14. The computer program product of claim 13 , wherein the hardware process extracts the features from the regions of interest in the input image premise using a Mask Region-based Convolutional Neural Network. 15. The computer program product of claim 12 , wherein said deriving step use a self-attention process to identify the elementary ones of the features from an entirety of the features. 16. The computer program product of claim 12 , wherein said extracting step comprises extracting region specific feature vectors for the input premise. 17. The computer program product of claim 12 , wherein the regions of interest are specified at a feature map level. 18. The computer program product of claim 12 , wherein the regions of interest are specified at a semantic level. 19. A computer processing system for determining entailment between an input premise and an input hypothesis of different modalities, comprising: a memory device including program code stored thereon; a hardware processor, operatively coupled to the memory device, and configured to run the program code stored on the memory device to extract, by applying a Mask Region Convolutional Neural Network (R-CNN), features from the input hypothesis and an entirety of and regions of interest in the input premise; derive intra-modal relevant information while suppressing intra-modal irrelevant information, based on intra-modal interactions between elementary ones of the features of the input hypothesis and between elementary ones of the features of the input premise; attach cross-modal relevant information to the features from the input premise to the features from the input hypothesis by deriving, using a gated recurrent unit, overall sentence hypothesis features from an output of a text self-attention process to form a cross-modal representation, based on cross-modal interactions between pairs of different elementary features from different modalities; providing labels consisting of entailment, neutral, and contradiction for classification usage; and classify a relationship between the input premise and the input hypothesis using a label selected from a group consisting of entailment, neutral, and contradiction based on the cross-modal representation.

Assignees

Inventors

Classifications

  • G06V20/43Primary

    of news video content · CPC title

  • Overlay text, e.g. embedded captions in a TV programme · CPC title

  • Classification techniques · CPC title

  • using neural networks · CPC title

  • Classification techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11250299B2 cover?
A method is provided for determining entailment between an input premise and an input hypothesis of different modalities. The method includes extracting features from the input hypothesis and an entirety of and regions of interest in the input premise. The method further includes deriving intra-modal relevant information while suppressing intra-modal irrelevant information, based on intra-modal…
Who is the assignee on this patent?
Nec Lab America Inc, Nec Corp
What technology area does this patent fall under?
Primary CPC classification G06V20/43. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 15 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).