Systems and methods for multimodal layout designs of digital publications

US12536720B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-12536720-B2
Application numberUS-202318161680-A
CountryUS
Kind codeB2
Filing dateJan 30, 2023
Priority dateSep 16, 2022
Publication dateJan 27, 2026
Grant dateJan 27, 2026

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments described herein provide systems and methods for multimodal layout generations for digital publications. The system may receive as inputs, a background image, one or more foreground texts, and one or more foreground images. Feature representations of the background image may be generated. The foreground inputs may be input to a layout generator which has cross attention to the background image feature representations in order to generate a layout comprising of bounding box parameters for each input item. A composite layout may be generated based on the inputs and generated bounding boxes. The resulting composite layout may then be displayed on a user interface.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of generating a visual layout for presenting content elements, the method comprising: receiving, via a data interface, a background image and a plurality of multimodal foreground elements including at least an image and a text; generating, by an image encoder, an image representation of the image with areas covered by the plurality of multimodal foreground elements inpainted; generating, by a text encoder, a text representation of the text; generating, by a visual transformer encoder, tokenized feature representations from of the background image; generating, by attention layers of a transformer decoder that is trained by layout parameters of prior layout samples, cross attention between the image representation and a concatenation of the text representation and the feature representations; generating, by the transformer decoder, layout bounding box parameters for the foreground elements based on attention weights from the cross attention; and generating, via a user interface, the layout by overlaying the foreground elements over the background image according to the layout bounding box parameters. 2 . The method of claim 1 , further comprising: generating variations of the layout bounding box parameters based on ensuring the foreground elements do not overlap; and generating variations of the layout by overlaying the foreground elements over the background image according to the variations of the layout bounding box parameters. 3 . The method of claim 1 , wherein the text includes any combination of a category label, a length, and a natural language text, and wherein the generating, by a text encoder, a text representation of the text comprises concatenating representations of the category label, the length, and the natural language text. 4 . The method of claim 1 , further comprising: sampling a vector based on a gaussian noise distribution; encoding the sampled vector; and concatenating the encoded vector with the text representation, wherein the cross attention with the text representation is cross attention with a representation based on the concatenated encoded vector and text representation. 5 . The method of claim 4 , further comprising: training the transformer decoder together with a layout encoder, wherein the layout encoder generates the gaussian noise distribution based on a bounding box parameter of a training layout. 6 . The method of claim 1 , further comprising: training a conditional discriminator to predict if a layout is a layout from a training dataset or a generated layout; and training the transformer decoder to minimize an accuracy of the conditional discriminator. 7 . The method of claim 6 , further comprising: training an auxiliary decoder to reconstruct the text based on a final feature layer of the conditional discriminator; and further training the conditional discriminator to maximize an accuracy of the auxiliary decoder. 8 . The method of claim 1 , further comprising: training a conditional reconstructor to reconstruct the text and the image based on a final feature layer of the transformer decoder; and training the transformer decoder to maximize an accuracy of the conditional reconstructor. 9 . A system for generating a visual layout for presenting content elements, the system comprising: a memory that stores a transformer decoder and a plurality of processor executable instructions; a communication interface that receives a background image and a plurality of multimodal foreground elements including at least an image and a text; and one or more hardware processors that read and execute the plurality of processor-executable instructions from the memory to perform operations comprising: generating, by an image encoder, an image representation of the image with areas covered by the plurality of multimodal foreground elements inpainted; generating, by a text encoder, a text representation of the text; generating, by a visual transformer encoder, tokenized feature representations from the background image; generating, by attention layers of the transformer decoder that is trained by layout parameters of prior layout samples, cross attention between the image representation and a concatenation of the text representation and the feature representations; generating, by the transformer decoder, layout bounding box parameters for the foreground elements based on attention weights from the cross attention; and generating, via a user interface, the layout by overlaying the foreground elements over the background image according to the layout bounding box parameters. 10 . The system of claim 9 , the operations further comprising: generating variations of the layout bounding box parameters based on ensuring the foreground elements do not overlap; and generating variations of the layout by overlaying the foreground elements over the background image according to the variations of the layout bounding box parameters. 11 . The system of claim 9 , wherein the text includes any combination of a category label, a length, and a natural language text, and wherein the generating, by a text encoder, a text representation of the text comprises concatenating representations of the category label, the length, and the natural language text. 12 . The system of claim 9 , the operations further comprising: sampling a vector based on a gaussian noise distribution; encoding the sampled vector; and concatenating the encoded vector with the text representation, wherein the cross attention with the text representation is cross attention with a representation based on the concatenated encoded vector and text representation. 13 . The system of claim 12 , the operations further comprising: training the transformer decoder together with a layout encoder, wherein the layout encoder generates the gaussian noise distribution based on a bounding box parameter of a training layout. 14 . The system of claim 9 , the operations further comprising: training a conditional discriminator to predict if a layout is a layout from a training dataset or a generated layout; and training the transformer decoder to minimize an accuracy of the conditional discriminator. 15 . The system of claim 14 , the operations further comprising: training an auxiliary decoder to reconstruct the text based on a final feature layer of the conditional discriminator; and further training the conditional discriminator to maximize an accuracy of the auxiliary decoder. 16 . The system of claim 9 , the operations further comprising: training a conditional reconstructor to reconstruct the text and the image based on a final feature layer of the transformer decoder; and training the transformer decoder to maximize an accuracy of the conditional reconstructor. 17 . A non-transitory machine-readable medium comprising a plurality of machine-executable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform operations comprising: receiving, via a data interface, a background image and a plurality of multimodal foreground elements including at least an image and a text; generating, by an image encoder, an image representation of the image with areas covered by the plurality of multimodal foreground elements inpainted; generating, by a text encoder, a text representation of the text; generating, by a visual transformer encoder, tokenized feature representations of from the background image; generating, by attention

Assignees

Inventors

Classifications

  • Bounding box · CPC title

  • involving graphical user interfaces [GUIs] · CPC title

  • G06T9/00Primary

    Image coding (bandwidth or redundancy reduction for static pictures H04N1/41; coding or decoding of static colour picture signals H04N1/64; methods or arrangements for coding, decoding, compressing or decompressing digital video signals H04N19/00) · CPC title

  • Machine learning · CPC title

  • Character encoding · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US12536720B2 cover?
Embodiments described herein provide systems and methods for multimodal layout generations for digital publications. The system may receive as inputs, a background image, one or more foreground texts, and one or more foreground images. Feature representations of the background image may be generated. The foreground inputs may be input to a layout generator which has cross attention to the backg…
Who is the assignee on this patent?
Salesforce Inc
What technology area does this patent fall under?
Primary CPC classification G06T9/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 27 2026 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).