Selection of a representative data subset of a set of unstructured data

US11232124B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11232124-B2
Application numberUS-202016751063-A
CountryUS
Kind codeB2
Filing dateJan 23, 2020
Priority dateJan 22, 2013
Publication dateJan 25, 2022
Grant dateJan 25, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments are directed towards generating a representative sampling as a subset from a larger dataset that includes unstructured data. A graphical user interface enables a user to provide various data selection parameters, including specifying a data source and one or more subset types desired, including one or more of latest records, earliest records, diverse records, outlier records, and/or random records. Diverse and/or outlier subset types may be obtained by generating clusters from an initial selection of records obtained from the larger dataset. An iteration analysis is performed to determine whether a sufficient number of clusters and/or cluster types have been generated that exceed at least one threshold and when not exceeded, additional clustering is performed on additional records. From the resultant clusters, and/or other subtype results, a subset of records is obtained as the representative sampling subset.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method comprising: retrieving a plurality of events from a data source in accordance with a selected data subset type of a plurality of defined, user-selectable data subset types, wherein the selected data subset type is a combination of at least two of the plurality of defined, user-selectable data subset types; identifying similarity between two or more of the retrieved events; determining whether any of the retrieved events form a group, based on the identified similarity; selecting, based on a determination that two or more of the retrieved events form a group, at least a subset of the retrieved events that form the group, as a representative data subset; and causing display of the selected events in a user interface as the representative data subset, wherein said causing display comprises causing display of the selected events in a user interface that enables development of a field extraction rule that specifies how to extract a value for a field from information in one or more events. 2. The method of claim 1 , wherein said determining whether any of the retrieved events form a group comprises applying a clustering algorithm to the retrieved events. 3. The method of claim 1 , further comprising: receiving, from a user, a selection of a data source type from which to generate the representative data subset. 4. The method of claim 1 , further comprising: receiving, from a user, a selection of the data subset type, of the plurality of defined data subset types, for identifying an event to include in the representative data subset. 5. The method of claim 1 , further comprising: receiving, from a user, a selection of a number of desired representative events to be included in the representative data subset. 6. The method of claim 1 , further comprising: receiving, from a user, selections of: a data source type from which to generate the representative data subset, one or a combination of subset types, of a plurality of defined data subset types, for identifying an event to include in the representative data subset, and a number of desired representative events to be included in the representative data subset. 7. The method of claim 1 , wherein each of the retrieved events includes raw data indicative of performance or activity of one or more components of an information technology environment. 8. The method of claim 1 , wherein the plurality of defined data subset types corresponds to a plurality of subtype processes that include one or more of a diverse event-identification process, an outlier event-identification process, a random event identification process, an earlier event-identification process, or a later event-identification process. 9. The method of claim 1 , wherein determining whether any of the retrieved events form a group is part of applying a clustering algorithm to the plurality of events to form a plurality of clusters; the method further comprising: determining that a number of clusters in the plurality of clusters is not of a sufficiently large number; and clustering a larger group of events in the plurality of events than the group of events. 10. The method of claim 1 , wherein each event in the plurality of events is associated with a time stamp. 11. The method of claim 1 , wherein each event in the plurality of events is associated with a time stamp that has been extracted from data in that event. 12. The method of claim 1 , wherein retrieving events from the data source includes using a process to identify outlier events. 13. The method of claim 1 , wherein retrieving events from the data source includes using a process to identify events associated with earliest events in the plurality of events. 14. The method of claim 1 , wherein retrieving events from the data source includes using a process to identify events associated with latest events in the plurality of events. 15. A non-transitory, machine-readable storage medium storing instructions, execution of which in a computer system causes the computer system to perform operations comprising: retrieving a plurality of events from a data source in accordance with a selected data subset type of a plurality of defined, user-selectable data subset types, wherein the selected data subset type is a combination of at least two of the plurality of defined, user-selectable data subset types; identifying similarity between two or more of the retrieved events; determining whether any of the retrieved events form a group, based on the identified similarity; selecting, based on a determination that two or more of the retrieved events form a group, at least a subset of the retrieved events that form the group, as a representative data subset; and causing display of the selected events in a user interface as the representative data subset, wherein said causing display comprises causing display of the selected events in a user interface that enables development of a field extraction rule that specifies how to extract a value for a field from information in one or more events. 16. The machine-readable storage medium of claim 15 , wherein said determining whether any of the retrieved events form a group comprises applying a clustering algorithm to the retrieved events. 17. The machine-readable storage medium of claim 15 , storing further instructions, execution of which in a computer system causes the computer system to perform operations comprising: receiving, from a user, a selection of a data source type from which to generate the representative data subset. 18. The machine-readable storage medium of claim 15 , storing further instructions, execution of which in a computer system causes the computer system to perform operations comprising: receiving, from a user, a selection of the data subset type, of the plurality of defined data subset types, for identifying an event to include in the representative data subset. 19. The machine-readable storage medium of claim 15 , storing further instructions, execution of which in a computer system causes the computer system to perform operations comprising: receiving, from a user, a selection of a number of desired representative events to be included in the representative data subset. 20. The machine-readable storage medium of claim 15 , storing further instructions, execution of which in a computer system causes the computer system to perform operations comprising: receiving, from a user, selections of: a data source type from which to generate the representative data subset, one or a combination of subset types, of a plurality of defined data subset types, for identifying an event to include in the representative data subset, and a number of desired representative events to be included in the representative data subset. 21. The machine-readable storage medium of claim 15 , wherein each of the retrieved events includes raw data indicative of performance or activity of one or more components of an information technology environment. 22. The machine-readable storage medium of claim 15 , wherein the plurality of defined data subset types corresponds to a plurality of subtype processes that include one or more of a diverse event-identification process, an outlier event-identification process, a random event identification process, an earlier event-identification process, or a later event-identification process. 23. The machine-readable storage medium of claim 15 , wherein determinin

Assignees

Inventors

Classifications

  • G06F16/35Primary

    Clustering; Classification · CPC title

  • Sorting, i.e. extracting data from one or more carriers, rearranging the data in numerical or other ordered sequence, and rerecording the sorted data on the original carrier or on a different carrier or set of carriers {sorting methods in general}(G06F7/36 takes precedence) · CPC title

  • Visualization; Browsing · CPC title

  • Selection of displayed objects or displayed text elements (G06F3/0482 takes precedence) · CPC title

  • G06F16/254Primary

    Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11232124B2 cover?
Embodiments are directed towards generating a representative sampling as a subset from a larger dataset that includes unstructured data. A graphical user interface enables a user to provide various data selection parameters, including specifying a data source and one or more subset types desired, including one or more of latest records, earliest records, diverse records, outlier records, and/or…
Who is the assignee on this patent?
Splunk Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/35. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 25 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).