Align-to-ground, weakly supervised phrase grounding guided by image-caption alignment
US-2021056742-A1 · Feb 25, 2021 · US
US2022019744A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2022019744-A1 |
| Application number | US-202117319189-A |
| Country | US |
| Kind code | A1 |
| Filing date | May 13, 2021 |
| Priority date | Jul 14, 2020 |
| Publication date | Jan 20, 2022 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A multi-modal pre-training model acquisition method, an electronic device and a storage medium, which relate to the fields of deep learning and natural language processing, are disclosed. The method may include: determining, for each image-text pair as training data, to-be-processed fine-grained semantic word in the text; masking the to-be-processed fine-grained semantic words; and training the multi-modal pre-training model using the training data with the fine-grained semantic words masked.
Opening claim text (preview).
What is claimed is: 1 . A multi-modal pre-training model acquisition method, comprising: determining, for each image-text pair as training data, to-be-processed fine-grained semantic words in the text; masking the to-be-processed fine-grained semantic words; and training the multi-modal pre-training model using the training data with the fine-grained semantic words masked. 2 . The method according to claim 1 , wherein the to-be-processed fine-grained semantic words comprise: entity words, attribute words and relationship words, wherein the attribute represents the attribute of each entity, and the relationship represents the relationship between entities. 3 . The method according to claim 2 , wherein the determining to-be-processed fine-grained semantic words in the text comprises: acquiring a scene graph corresponding to the text, and determining the to-be-processed fine-grained semantic words according to the scene graph. 4 . The method according to claim 3 , wherein the scene graph comprises: entity nodes, attribute tuples and relationship triples, each attribute tuple is composed of one entity node and one attribute node, and each relationship triple is composed of two entity nodes and one relationship node; and wherein the determining the to-be-processed fine-grained semantic words according to the scene graph comprises: selecting a predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph, and taking entity words in the text corresponding to the selected entity nodes, attribute words in the text corresponding to attribute nodes in the selected attribute tuples, and relationship words in the text corresponding to relationship nodes in the selected relationship triples, as the to-be-processed fine-grained semantic words. 5 . The method according to claim 4 , wherein the selecting the predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph comprises: determining a number of nodes to be selected according to a total number of nodes comprised in the scene graph as the predetermined number; and randomly selecting the predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph. 6 . The method according to claim 2 , wherein training tasks of the multi-modal pre-training model comprise: entity prediction, attribute prediction and relationship prediction; and wherein the multi-modal pre-training model predicts masked words in the text according to a context of the text and corresponding image content. 7 . The method according to claim 1 , further comprising: after completion of the training of the multi-modal pre-training model, fine-tuning, for any downstream task, the multi-modal pre-training model according to the training data corresponding to the downstream task. 8 . An electronic device, comprising: at least one processor; and a memory in communication connection with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to carry out a multi-modal pre-training model acquisition method, which comprises: determining, for each image-text pair as training data, to-be-processed fine-grained semantic words in the text; masking the to-be-processed fine-grained semantic words; and training the multi-modal pre-training model using the training data with the fine-grained semantic words masked. 9 . The electronic device according to claim 8 , wherein the to-be-processed fine-grained semantic words comprise: entity words, attribute words and relationship words, wherein the attribute represents the attribute of each entity, and the relationship represents the relationship between entities. 10 . The electronic device according to claim 9 , wherein the determining to-be-processed fine-grained semantic words in the text comprises: acquiring a scene graph corresponding to the text, and determining the to-be-processed fine-grained semantic words according to the scene graph. 11 . The electronic device according to claim 10 , wherein the scene graph comprises: entity nodes, attribute tuples and relationship triples, each attribute tuple is composed of one entity node and one attribute node, and each relationship triple is composed of two entity nodes and one relationship node; and wherein the determining the to-be-processed fine-grained semantic words according to the scene graph comprises: selecting a predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph, and taking entity words in the text corresponding to the selected entity nodes, attribute words in the text corresponding to attribute nodes in the selected attribute tuples, and relationship words in the text corresponding to relationship nodes in the selected relationship triples, as the to-be-processed fine-grained semantic words. 12 . The electronic device according to claim 11 , wherein the selecting the predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph comprises: determining a number of nodes to be selected according to a total number of nodes comprised in the scene graph as the predetermined number; and randomly selecting the predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph. 13 . The electronic device according to claim 9 , wherein training tasks of the multi-modal pre-training model comprise: entity prediction, attribute prediction and relationship prediction; and wherein the multi-modal pre-training model predicts masked words in the text according to a context of the text and corresponding image content. 14 . The electronic device according to claim 8 , wherein the method further comprises: after completion of the training of the multi-modal pre-training model, fine-tuning, for any downstream task, the multi-modal pre-training model according to the training data corresponding to the downstream task. 15 . A non-transitory computer-readable storage medium comprising instructions, which, when executed by a computer, cause the computer to carry out a multi-modal pre-training model acquisition method, which comprises: determining, for each image-text pair as training data, to-be-processed fine-grained semantic words in the text; masking the to-be-processed fine-grained semantic words; and training the multi-modal pre-training model using the training data with the fine-grained semantic words masked. 16 . The non-transitory computer-readable storage medium according to claim 15 , wherein the to-be-processed fine-grained semantic words comprise: entity words, attribute words and relationship words, wherein the attribute represents the attribute of each entity, and the relationship represents the relationship between entities. 17 . The non-transitory computer-readable storage medium according to claim 16 , wherein the determining to-be-processed fine-grained semantic words in the text comprises: acquiring a scene graph corresponding to the text, and determining the to-be-processed fine-grained semantic words according to the scene graph. 18 . The non-transitory computer-readable storage medium according to claim 17 , wherein the scene graph comprises: entity nodes, attribute tuples and relationship triples, each attribute tuple is composed of one entity node and one attribute node, and each relationship triple is composed of two ent
Lexical analysis, e.g. tokenisation or collocates · CPC title
in albums, collections or shared content, e.g. social network photos or video · CPC title
the classifiers operating on different input data, e.g. multi-modal recognition · CPC title
Semantic analysis · CPC title
of results relating to different input data, e.g. multimodal recognition · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.