Automated caption generation from a dataset

US11775756B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11775756-B2
Application numberUS-202017094435-A
CountryUS
Kind codeB2
Filing dateNov 10, 2020
Priority dateNov 10, 2020
Publication dateOct 3, 2023
Grant dateOct 3, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A dataset captioning system is described that generates captions of text to describe insights identified from a dataset, automatically and without user intervention. To do so, given an input of a dataset the dataset captioning system determines which data insights are likely to support potential visualizations of the dataset, generates text based on these insights, orders the text, processes the ordered text for readability, and then outputs the text as a caption. These techniques also include adjustments made to the complexity of the text, globalization of the text, inclusion of links to outside sources of information, translation of the text, and so on as part of generating the caption.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: generating, by a processing device automatically and without user intervention, a caption that textually describes a dataset having a plurality of data entries organized as a plurality of data subsets, the generating including: determining which datatypes are included in the plurality of data subsets, respectively; identifying a composition including a visualization of the dataset from a plurality of compositions based on the datatypes and a set of pre-defined heuristics that detect whether the data subsets include an outlier; determining which data insights correspond to the composition by detecting cyclic patterns in the dataset and determining which cyclic patterns are statistically significant by comparing a correlation coefficient associated with a data insight to a threshold significance value; detecting semantic datatypes in the data subsets from a pre-defined taxonomy; generating text, based on the data insights that correspond to the composition and based on the semantic datatypes, from the plurality of data entries of the dataset; editing the text for readability based on a relationship between the data insights; and forming the caption based at least in part on the edited text. 2. The method as described in claim 1 , wherein the forming includes: generating scores based on the text generated for the data insights; and ranking the text generated for the data insights based on the scores from general to specific. 3. The method as described in claim 2 , wherein the forming of the caption includes ordering the text based on the ranking. 4. The method as described in claim 2 , wherein the scores are based on degrees of specificity. 5. The method as described in claim 1 , wherein the plurality of datatypes includes quantitative, nominal, ordinal, temporal, or semantic. 6. The method as described in claim 1 , wherein the data insights include anomaly, cyclic pattern, derived value, relative value, threshold amount of change, or extremes based on a minimum amount or a maximum amount. 7. The method as described in claim 1 , wherein the forming of the caption includes adjusting language complexity of the text. 8. The method as described in claim 1 , wherein the forming of the caption includes editing text generated for a first said data insight based on text generated for a second said data insight as part of the caption. 9. The method as described in claim 1 , wherein the forming of the caption includes generating a link, to a network address, included as part of the caption, the link generated based on at least a portion of the text. 10. The method as described in claim 1 , wherein the identifying of the composition is based on which combination of the datatypes is included in the dataset. 11. The method as described in claim 10 , wherein the composition is: temporal based on inclusion of a temporal datatype and a quantitative datatype as part of the datatypes of the plurality of data subsets; or segment comparison based on inclusion of a quantitative datatype and a quantitative datatype as part of the datatypes of the plurality of data subsets. 12. The method as described in claim 1 , further comprising receiving a user input specifying the dataset via a user interface, the dataset including a portion of a table of a larger dataset in a user interface and the data subsets are configured as rows or columns of the table, the rows or the columns share a characteristic of the datatypes. 13. A system comprising: a dataset input module implemented at least partially in hardware of a processing device to receive a dataset having a plurality of data entries and identify a composition including a visualization of the dataset from a plurality of compositions based on datatypes of the composition and a set of pre-defined heuristics that detect whether the plurality of data entries include an outlier; a text generation module implemented at least partially in hardware of the processing device to generate text based on a plurality of data insights from the plurality of data entries of the dataset by determining which data insights correspond to the composition by detecting cyclic patterns in the dataset and determining which cyclic patterns are statistically significant by comparing a correlation coefficient associated with a data insight to a threshold significance value; detecting semantic datatypes in the dataset from a pre-defined taxonomy; and a caption formation module implemented at least partially in hardware of the processing device to generate a caption including text based on the data insights that correspond to the composition and based on the semantic datatypes, the caption formation module including a complexity adjustment module configured to adjust language complexity of the text as part of the caption. 14. The system as described in claim 13 , wherein the caption formation module further comprises: a score generation module to generate scores corresponding to the data insights, respectively; a ranking module configured to rank the text based on the scores corresponding to respective said data insights; and a text ordering module configured to order the text as part of the caption based on respective said scores. 15. The system as described in claim 14 , wherein the scores are based on degrees of specificity. 16. The system as described in claim 13 , wherein the caption formation module further comprises a readability module to edit the text generated for a first said data insight based on text generated for a second said data insight. 17. The system as described in claim 13 , wherein the caption formation module further comprises a readability module to edit the text for safety. 18. The system as described in claim 13 , wherein the caption formation module further comprises: a link generation module configured to generate a link as part of the caption, the link generated based on at least a portion of the text; and a translation module configured to translate the text. 19. A system comprising: means for generating, automatically and without user intervention, a caption that textually describes a dataset having a plurality of data entries, the generating means including: means for receiving a dataset having a plurality of data entries: means for identifying a composition including a visualization of the dataset from a plurality of compositions based on datatypes of the composition and a set of pre-defined heuristics that detect whether the plurality of data entries include an outlier; means for determining which data insights correspond to the composition by detecting cyclic patterns in the dataset and determining which cyclic patterns are statistically significant by comparing a correlation coefficient associated with a data insight to a threshold significance value; detecting semantic datatypes in the dataset from a pre-defined taxonomy; means for generating text based on the data insights that correspond to the composition and based on the semantic datatypes; means for ordering the text based on a ranking; and means for editing the ordered text for readability such that text generated for a first said data insight is edited based on text generated for a second said data insight. 20. The system as described in claim 19 , further comprising: means for adjusting language complexity of the text as part of the caption; means for checking safety of the text as part of the caption; means for translating the text as part of the c

Assignees

Inventors

Classifications

  • G06F40/216Primary

    using statistical methods · CPC title

  • Annotation, e.g. comment data or footnotes · CPC title

  • Named entity recognition · CPC title

  • G06F40/56Primary

    Natural language generation · CPC title

  • Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11775756B2 cover?
A dataset captioning system is described that generates captions of text to describe insights identified from a dataset, automatically and without user intervention. To do so, given an input of a dataset the dataset captioning system determines which data insights are likely to support potential visualizations of the dataset, generates text based on these insights, orders the text, processes th…
Who is the assignee on this patent?
Adobe Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/216. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 11 related publications on this page (citations in our corpus or others sharing the same primary CPC).