Facilitating dynamic document layout by determining reading order using document content stream cues
US-11176310-B2 · Nov 16, 2021 · US
US12400384B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-12400384-B2 |
| Application number | US-202318460401-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 1, 2023 |
| Priority date | Sep 1, 2023 |
| Publication date | Aug 26, 2025 |
| Grant date | Aug 26, 2025 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments are disclosed for reflowing documents to display semantically related content. The method may include receiving a request to view a document that includes body text and one or more images. A trimodal document relationship model identifies relationships between segments of the body text and the one or more images. A linearized view of the document is generated based on the relationships and the linearized view is caused to be displayed on a user device.
Opening claim text (preview).
We claim: 1. A method, comprising: receiving a request to view a document that includes body text, one or more images, and one or more associated captions; identifying, using a trimodal document relationship model, relationships between segments of the body text and the one or more images, wherein the trimodal document relationship model: generates a contextual embedding for each segment of the body text, image, and associated caption, and predicts at least one segment of the body text associated with each image from the one or more images based on a similarity score determined between a plurality of image-caption pairs and segments of the body text based on their contextual embeddings which encode a combination of image embeddings, text embeddings, segment embeddings, and position embeddings; generating a linearized view of the document based on the relationships; and causing the linearized view to be displayed on a user device. 2. The method of claim 1 , wherein identifying, using a trimodal document relationship model, relationships between segments of the body text and the one or more images, further comprises: receiving, by the trimodal document relationship model, a plurality of segments of the body text, the one or more images, and one or more associated captions from the document. 3. The method of claim 2 , wherein the trimodal document relationship model includes a transformer encoder. 4. The method of claim 3 , wherein each segment embedding defines a segment type and each position embedding indicates a position of the segment of body text, image, or associated caption. 5. The method of claim 1 , wherein a segment of body text includes a section, a paragraph, or a sentence. 6. The method of claim 1 , wherein the linearized view is a linear presentation of the segments of the body text, wherein each segment of the body text determined to be associated with an image from the one or more images has an associated user interface element rendered in the linearized view. 7. The method of claim 6 , further comprising: receiving a selection of a first user interface element associated with a first segment of the body text in the linearized view; and causing an adjustable split screen to be displayed on the user device, wherein a first pane of the split screen displays a first image determined to be associated with the first segment, and a second pane of the split screen displays at least some of the first segment of the body text. 8. The method of claim 7 , wherein multiple images are associated with the first segment of the body text, and wherein the first screen of the split screen includes a second user interface element which, when selected, causes a different image from the multiple images to be displayed in the first screen. 9. The method of claim 7 , wherein the first pane and the second pane of the adjustable split screen are resizable using an interactive user interface element. 10. A non-transitory computer-readable medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations comprising: receiving a request to view a document that includes body text, one or more images, and one or more associated captions; identifying, using a trimodal document relationship model, relationships between segments of the body text and the one or more images, wherein the trimodal document relationship model: generates a contextual embedding for each segment of the body text, image, and associated caption, and predicts at least one segment of the body text associated with each image from the one or more images based on a similarity score determined between a plurality of image-caption pairs and segments of the body text based on their contextual embeddings which encode a combination of image embeddings, text embeddings, segment embeddings, and position embeddings; generating a linearized view of the document based on the relationships; and causing the linearized view to be displayed on a user device. 11. The non-transitory computer-readable medium of claim 10 , wherein the operation of identifying, using a trimodal document relationship model, relationships between segments of the body text and the one or more images, further comprises: receiving, by the trimodal document relationship model, a plurality of segments of the body text, the one or more images, and one or more associated captions from the document. 12. The non-transitory computer-readable medium of claim 11 , wherein the trimodal document relationship model includes a transformer encoder. 13. The non-transitory computer-readable medium of claim 12 , wherein each segment embedding defines a segment type and each position embedding indicates a position of the segment of body text, image, or associated caption. 14. The non-transitory computer-readable medium of claim 10 , wherein the linearized view is a linear presentation of the segments of the body text, wherein each segment of the body text determined to be associated with an image from the one or more images has an associated user interface element rendered in the linearized view. 15. The non-transitory computer-readable medium of claim 14 , wherein the operations further comprise: receiving a selection of a first user interface element associated with a first segment of the body text in the linearized view; and causing an adjustable split screen to be displayed on the user device, wherein a first pane of the split screen displays a first image determined to be associated with the first segment, and a second pane of the split screen displays at least some of the first segment of the body text. 16. The non-transitory computer-readable medium of claim 15 , wherein multiple images are associated with the first segment of the body text, and wherein the first screen of the split screen includes a second user interface element which, when selected, causes a different image from the multiple images to be displayed in the first screen. 17. The non-transitory computer-readable medium of claim 15 , wherein the first pane and the second pane of the adjustable split screen are resizable using an interactive user interface element. 18. A system, comprising: a memory component; and a processing device coupled to the memory component, the processing device to perform operations comprising: receiving, by a trimodal document relationship model, a plurality of elements of a document, wherein the elements include segments of body text, images, and image captions; generating, by a feature extractor of the trimodal document relationship model, an element embedding for each element of the document; generating a segment embedding, indicating an element type, and a position embedding, indicating an element position within the document, for each element of the document; combining each element embedding, segment embedding, and position to create a combined embedding for each element of the document; generating, by a transformer encoder of the trimodal document relationship model, a contextual embedding for each element of the document corresponding to each combined embedding; and determining semantic relationships between the segments of the body text and the images in the document using their contextual embeddings. 19. The system of claim 18 , wherein the operations further comprise: generating a reflowed document based on the semantic relationships. 20. The system of claim 18 , wherein the operation of determining semantic relationships bet
for image manipulation, e.g. dragging, rotation, expansion or change of colour · CPC title
Split screen, i.e. subdividing the display area or the window area into separate subareas · CPC title
involving graphical user interfaces [GUIs] · CPC title
Selection of displayed objects or displayed text elements (G06F3/0482 takes precedence) · CPC title
Annotation, e.g. comment data or footnotes · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.