Generating insights based on numeric and categorical data
US-2021365471-A1 · Nov 25, 2021 · US
US11775756B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11775756-B2 |
| Application number | US-202017094435-A |
| Country | US |
| Kind code | B2 |
| Filing date | Nov 10, 2020 |
| Priority date | Nov 10, 2020 |
| Publication date | Oct 3, 2023 |
| Grant date | Oct 3, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A dataset captioning system is described that generates captions of text to describe insights identified from a dataset, automatically and without user intervention. To do so, given an input of a dataset the dataset captioning system determines which data insights are likely to support potential visualizations of the dataset, generates text based on these insights, orders the text, processes the ordered text for readability, and then outputs the text as a caption. These techniques also include adjustments made to the complexity of the text, globalization of the text, inclusion of links to outside sources of information, translation of the text, and so on as part of generating the caption.
Opening claim text (preview).
What is claimed is: 1. A method comprising: generating, by a processing device automatically and without user intervention, a caption that textually describes a dataset having a plurality of data entries organized as a plurality of data subsets, the generating including: determining which datatypes are included in the plurality of data subsets, respectively; identifying a composition including a visualization of the dataset from a plurality of compositions based on the datatypes and a set of pre-defined heuristics that detect whether the data subsets include an outlier; determining which data insights correspond to the composition by detecting cyclic patterns in the dataset and determining which cyclic patterns are statistically significant by comparing a correlation coefficient associated with a data insight to a threshold significance value; detecting semantic datatypes in the data subsets from a pre-defined taxonomy; generating text, based on the data insights that correspond to the composition and based on the semantic datatypes, from the plurality of data entries of the dataset; editing the text for readability based on a relationship between the data insights; and forming the caption based at least in part on the edited text. 2. The method as described in claim 1 , wherein the forming includes: generating scores based on the text generated for the data insights; and ranking the text generated for the data insights based on the scores from general to specific. 3. The method as described in claim 2 , wherein the forming of the caption includes ordering the text based on the ranking. 4. The method as described in claim 2 , wherein the scores are based on degrees of specificity. 5. The method as described in claim 1 , wherein the plurality of datatypes includes quantitative, nominal, ordinal, temporal, or semantic. 6. The method as described in claim 1 , wherein the data insights include anomaly, cyclic pattern, derived value, relative value, threshold amount of change, or extremes based on a minimum amount or a maximum amount. 7. The method as described in claim 1 , wherein the forming of the caption includes adjusting language complexity of the text. 8. The method as described in claim 1 , wherein the forming of the caption includes editing text generated for a first said data insight based on text generated for a second said data insight as part of the caption. 9. The method as described in claim 1 , wherein the forming of the caption includes generating a link, to a network address, included as part of the caption, the link generated based on at least a portion of the text. 10. The method as described in claim 1 , wherein the identifying of the composition is based on which combination of the datatypes is included in the dataset. 11. The method as described in claim 10 , wherein the composition is: temporal based on inclusion of a temporal datatype and a quantitative datatype as part of the datatypes of the plurality of data subsets; or segment comparison based on inclusion of a quantitative datatype and a quantitative datatype as part of the datatypes of the plurality of data subsets. 12. The method as described in claim 1 , further comprising receiving a user input specifying the dataset via a user interface, the dataset including a portion of a table of a larger dataset in a user interface and the data subsets are configured as rows or columns of the table, the rows or the columns share a characteristic of the datatypes. 13. A system comprising: a dataset input module implemented at least partially in hardware of a processing device to receive a dataset having a plurality of data entries and identify a composition including a visualization of the dataset from a plurality of compositions based on datatypes of the composition and a set of pre-defined heuristics that detect whether the plurality of data entries include an outlier; a text generation module implemented at least partially in hardware of the processing device to generate text based on a plurality of data insights from the plurality of data entries of the dataset by determining which data insights correspond to the composition by detecting cyclic patterns in the dataset and determining which cyclic patterns are statistically significant by comparing a correlation coefficient associated with a data insight to a threshold significance value; detecting semantic datatypes in the dataset from a pre-defined taxonomy; and a caption formation module implemented at least partially in hardware of the processing device to generate a caption including text based on the data insights that correspond to the composition and based on the semantic datatypes, the caption formation module including a complexity adjustment module configured to adjust language complexity of the text as part of the caption. 14. The system as described in claim 13 , wherein the caption formation module further comprises: a score generation module to generate scores corresponding to the data insights, respectively; a ranking module configured to rank the text based on the scores corresponding to respective said data insights; and a text ordering module configured to order the text as part of the caption based on respective said scores. 15. The system as described in claim 14 , wherein the scores are based on degrees of specificity. 16. The system as described in claim 13 , wherein the caption formation module further comprises a readability module to edit the text generated for a first said data insight based on text generated for a second said data insight. 17. The system as described in claim 13 , wherein the caption formation module further comprises a readability module to edit the text for safety. 18. The system as described in claim 13 , wherein the caption formation module further comprises: a link generation module configured to generate a link as part of the caption, the link generated based on at least a portion of the text; and a translation module configured to translate the text. 19. A system comprising: means for generating, automatically and without user intervention, a caption that textually describes a dataset having a plurality of data entries, the generating means including: means for receiving a dataset having a plurality of data entries: means for identifying a composition including a visualization of the dataset from a plurality of compositions based on datatypes of the composition and a set of pre-defined heuristics that detect whether the plurality of data entries include an outlier; means for determining which data insights correspond to the composition by detecting cyclic patterns in the dataset and determining which cyclic patterns are statistically significant by comparing a correlation coefficient associated with a data insight to a threshold significance value; detecting semantic datatypes in the dataset from a pre-defined taxonomy; means for generating text based on the data insights that correspond to the composition and based on the semantic datatypes; means for ordering the text based on a ranking; and means for editing the ordered text for readability such that text generated for a first said data insight is edited based on text generated for a second said data insight. 20. The system as described in claim 19 , further comprising: means for adjusting language complexity of the text as part of the caption; means for checking safety of the text as part of the caption; means for translating the text as part of the c
using statistical methods · CPC title
Annotation, e.g. comment data or footnotes · CPC title
Named entity recognition · CPC title
Natural language generation · CPC title
Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.