Multi-modal pre-training model acquisition method, electronic device and storage medium

US2022019744A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2022019744-A1
Application numberUS-202117319189-A
CountryUS
Kind codeA1
Filing dateMay 13, 2021
Priority dateJul 14, 2020
Publication dateJan 20, 2022
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A multi-modal pre-training model acquisition method, an electronic device and a storage medium, which relate to the fields of deep learning and natural language processing, are disclosed. The method may include: determining, for each image-text pair as training data, to-be-processed fine-grained semantic word in the text; masking the to-be-processed fine-grained semantic words; and training the multi-modal pre-training model using the training data with the fine-grained semantic words masked.

First claim

Opening claim text (preview).

What is claimed is: 1 . A multi-modal pre-training model acquisition method, comprising: determining, for each image-text pair as training data, to-be-processed fine-grained semantic words in the text; masking the to-be-processed fine-grained semantic words; and training the multi-modal pre-training model using the training data with the fine-grained semantic words masked. 2 . The method according to claim 1 , wherein the to-be-processed fine-grained semantic words comprise: entity words, attribute words and relationship words, wherein the attribute represents the attribute of each entity, and the relationship represents the relationship between entities. 3 . The method according to claim 2 , wherein the determining to-be-processed fine-grained semantic words in the text comprises: acquiring a scene graph corresponding to the text, and determining the to-be-processed fine-grained semantic words according to the scene graph. 4 . The method according to claim 3 , wherein the scene graph comprises: entity nodes, attribute tuples and relationship triples, each attribute tuple is composed of one entity node and one attribute node, and each relationship triple is composed of two entity nodes and one relationship node; and wherein the determining the to-be-processed fine-grained semantic words according to the scene graph comprises: selecting a predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph, and taking entity words in the text corresponding to the selected entity nodes, attribute words in the text corresponding to attribute nodes in the selected attribute tuples, and relationship words in the text corresponding to relationship nodes in the selected relationship triples, as the to-be-processed fine-grained semantic words. 5 . The method according to claim 4 , wherein the selecting the predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph comprises: determining a number of nodes to be selected according to a total number of nodes comprised in the scene graph as the predetermined number; and randomly selecting the predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph. 6 . The method according to claim 2 , wherein training tasks of the multi-modal pre-training model comprise: entity prediction, attribute prediction and relationship prediction; and wherein the multi-modal pre-training model predicts masked words in the text according to a context of the text and corresponding image content. 7 . The method according to claim 1 , further comprising: after completion of the training of the multi-modal pre-training model, fine-tuning, for any downstream task, the multi-modal pre-training model according to the training data corresponding to the downstream task. 8 . An electronic device, comprising: at least one processor; and a memory in communication connection with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to carry out a multi-modal pre-training model acquisition method, which comprises: determining, for each image-text pair as training data, to-be-processed fine-grained semantic words in the text; masking the to-be-processed fine-grained semantic words; and training the multi-modal pre-training model using the training data with the fine-grained semantic words masked. 9 . The electronic device according to claim 8 , wherein the to-be-processed fine-grained semantic words comprise: entity words, attribute words and relationship words, wherein the attribute represents the attribute of each entity, and the relationship represents the relationship between entities. 10 . The electronic device according to claim 9 , wherein the determining to-be-processed fine-grained semantic words in the text comprises: acquiring a scene graph corresponding to the text, and determining the to-be-processed fine-grained semantic words according to the scene graph. 11 . The electronic device according to claim 10 , wherein the scene graph comprises: entity nodes, attribute tuples and relationship triples, each attribute tuple is composed of one entity node and one attribute node, and each relationship triple is composed of two entity nodes and one relationship node; and wherein the determining the to-be-processed fine-grained semantic words according to the scene graph comprises: selecting a predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph, and taking entity words in the text corresponding to the selected entity nodes, attribute words in the text corresponding to attribute nodes in the selected attribute tuples, and relationship words in the text corresponding to relationship nodes in the selected relationship triples, as the to-be-processed fine-grained semantic words. 12 . The electronic device according to claim 11 , wherein the selecting the predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph comprises: determining a number of nodes to be selected according to a total number of nodes comprised in the scene graph as the predetermined number; and randomly selecting the predetermined number of entity nodes, attribute tuples and relationship triples from the scene graph. 13 . The electronic device according to claim 9 , wherein training tasks of the multi-modal pre-training model comprise: entity prediction, attribute prediction and relationship prediction; and wherein the multi-modal pre-training model predicts masked words in the text according to a context of the text and corresponding image content. 14 . The electronic device according to claim 8 , wherein the method further comprises: after completion of the training of the multi-modal pre-training model, fine-tuning, for any downstream task, the multi-modal pre-training model according to the training data corresponding to the downstream task. 15 . A non-transitory computer-readable storage medium comprising instructions, which, when executed by a computer, cause the computer to carry out a multi-modal pre-training model acquisition method, which comprises: determining, for each image-text pair as training data, to-be-processed fine-grained semantic words in the text; masking the to-be-processed fine-grained semantic words; and training the multi-modal pre-training model using the training data with the fine-grained semantic words masked. 16 . The non-transitory computer-readable storage medium according to claim 15 , wherein the to-be-processed fine-grained semantic words comprise: entity words, attribute words and relationship words, wherein the attribute represents the attribute of each entity, and the relationship represents the relationship between entities. 17 . The non-transitory computer-readable storage medium according to claim 16 , wherein the determining to-be-processed fine-grained semantic words in the text comprises: acquiring a scene graph corresponding to the text, and determining the to-be-processed fine-grained semantic words according to the scene graph. 18 . The non-transitory computer-readable storage medium according to claim 17 , wherein the scene graph comprises: entity nodes, attribute tuples and relationship triples, each attribute tuple is composed of one entity node and one attribute node, and each relationship triple is composed of two ent

Assignees

Inventors

Classifications

  • G06F40/284Primary

    Lexical analysis, e.g. tokenisation or collocates · CPC title

  • in albums, collections or shared content, e.g. social network photos or video · CPC title

  • the classifiers operating on different input data, e.g. multi-modal recognition · CPC title

  • G06F40/30Primary

    Semantic analysis · CPC title

  • of results relating to different input data, e.g. multimodal recognition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2022019744A1 cover?
A multi-modal pre-training model acquisition method, an electronic device and a storage medium, which relate to the fields of deep learning and natural language processing, are disclosed. The method may include: determining, for each image-text pair as training data, to-be-processed fine-grained semantic word in the text; masking the to-be-processed fine-grained semantic words; and training the…
Who is the assignee on this patent?
Beijing Baidu Netcom Sci & Tech Co Ltd
What technology area does this patent fall under?
Primary CPC classification G06F40/284. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 20 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).