Field extraction rules from clustered data samples

US11216491B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11216491-B2
Application numberUS-201615143563-A
CountryUS
Kind codeB2
Filing dateApr 30, 2016
Priority dateMar 31, 2016
Publication dateJan 4, 2022
Grant dateJan 4, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The operation of an automatic data input and query system is controlled by well-defined control data. Certain control data may relate to data schemas and direct operations performed by the system to extract fields from machine data. Automatic methods may determine proper field extraction control information by analyzing a sample of data from a source, breaking the sample data into event segments, classifying the segments into groups based on a measure of similarity, determining an operable extraction rule for each group, and storing the resulting extraction model. Data patterns known by the system can be leveraged to perform the event breaking and field identification for the classifying. Embodiments may provide a user interface to view, interact with, and approve the computer-generated extraction model.

First claim

Opening claim text (preview).

What is claimed: 1. A method, comprising: receiving a sample of machine data in a form produced by a data source; predicting breakpoints in the sample by performing pattern recognition, the breakpoints identifying boundaries between distinct event segments of a plurality of event segments of the sample, each event segment corresponding to an individual event, wherein the pattern recognition compares patterns in the sample with a plurality of event delimiter patterns from a plurality of data source types; generating, using field prediction, a parsed events view of the sample, the parsed events view comprising a plurality of predicted fields for each of the distinct events segments; classifying the event segments into at least a first group of event segments and a second group of event segments based at least in part on a measure of similarity within each of the first group and the second group, wherein the first group of event segments are in the plurality of event segments, wherein the second group of event segments are in the plurality of event segments, and wherein the measure of similarity includes a cluster analysis based at least in part on one or more from among connectivity-based clustering, centroid-based clustering, distribution based clustering, density-based clustering, canopy clustering, K-means clustering, subspace clustering, and correlation clustering; determining, using field data in the first group of event segments, an extraction rule for extracting a common set of one or more predicted fields from each event segment of the first group of event segments; determining, using field data in the second group of event segments, an extraction rule for extracting a common set of one or more predicted fields from each event segment of the second group of event segments; reclassifying the event segments into at least a third group of event segments and a fourth group of event segments until a successful set of extraction rules are identified, wherein the successful set of extraction rules are identified by examining, for the third group and the fourth group of event segments, a location, size, and content of the field data for a predefined criterion that extracts all of the field data from all of the plurality of predicted fields in the third group and the fourth group of event segments; and storing the extraction rules in computer memory, wherein the method is performed by a computing system comprising one or more processors. 2. The method of claim 1 , wherein classifying includes automatically identifying one or more fields in an event segment by matching patterns associated with one or more known fields. 3. The method of claim 1 , wherein classifying includes automatically identifying one or more fields in an event segment by matching patterns associated with one or more known fields of a late-binding schema. 4. The method of claim 1 , wherein classifying includes automatically identifying one or more fields in an event segment by matching patterns associated with one or more known fields of a late-binding schema, the known fields having an association with a domain category. 5. The method of claim 1 , wherein the measure of similarity includes a statistical classification. 6. The method of claim 1 , wherein the measure of similarity includes a cluster analysis. 7. The method of claim 1 , wherein: classifying includes automatically identifying one or more fields in an event segment; and the measure of similarity includes a statistical classification. 8. The method of claim 1 , wherein storing the successful set of extraction rules includes storing the successful set of extraction rules in association with a data sourcetype of the data source. 9. The method of claim 1 , wherein storing the successful set of extraction rules includes storing the successful set of extraction rules in association with the data source. 10. The method of claim 1 , wherein storing the successful set of extraction rules includes storing the successful set of extraction rules as an extraction model. 11. The method of claim 1 , wherein storing the successful set of extraction rules includes storing the successful set of extraction rules as an extraction model of a data source. 12. The method of claim 1 , wherein storing the successful set of extraction rules includes storing the successful set of extraction rules as an extraction model of a data sourcetype. 13. The method of claim 1 , wherein storing the successful set of extraction rules is conditioned, at least in part, on receiving an indication of acceptance based on user interaction with a user interface displaying a representation of one or more of the extraction rules in the successful set of extraction rules. 14. The method of claim 1 , wherein storing the successful set of extraction rules is conditioned, at least in part, on receiving an indication of acceptance based on user interaction with a user interface displaying a representation of one or more of the extraction rules in the successful set of extraction rules, the representation including a depiction of an event segment. 15. The method of claim 1 , wherein storing the successful set of extraction rules is conditioned, at least in part, on receiving an indication of acceptance based on user interaction with a user interface displaying a representation of one or more of the extraction rules, the representation including a depiction of an event segment having one or more field portions color coded in accordance with a particular extraction rule. 16. The method of claim 1 , wherein storing the successful set of extraction rules is conditioned, at least in part, on receiving an indication of acceptance based on user interaction with a user interface displaying a representation of one or more of the extraction rules, the representation including a depiction of an event segment having one or more field portions each substituted with a field identifier in accordance with a particular extraction rule. 17. The method of claim 1 , wherein the method is further performed by the computing system to deliver a response time of about 20 seconds or less. 18. The method of claim 1 , wherein the method is further performed by the computing system to deliver a response time of about 5 seconds or less. 19. A system comprising: a memory; and a processing device coupled with the memory to perform operations comprising: receiving a sample of machine data in a form produced by a data source; predicting breakpoints in the sample by performing pattern recognition, the breakpoints identifying boundaries between distinct event segments of a plurality of event segments of the sample, each event segment corresponding to an individual event, wherein the pattern recognition compares the patterns in the sample with a plurality of event delimiter patterns from a plurality of data source types; generating, using field prediction, a parsed events view of the sample, the parsed events view comprising a plurality of predicted fields for each of the distinct events segments; classifying the event segments into at least a first group of event segments and a second group of event segments based at least in part on a measure of similarity within each of the first group and the second group, wherein the first group of event segments are in the plurality of event segments, wherein the second group of event segments are in the plurality of event segments, and wherein the measure of similarity includes a cluster analysis based at least in part on one or more from among connec

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11216491B2 cover?
The operation of an automatic data input and query system is controlled by well-defined control data. Certain control data may relate to data schemas and direct operations performed by the system to extract fields from machine data. Automatic methods may determine proper field extraction control information by analyzing a sample of data from a source, breaking the sample data into event segment…
Who is the assignee on this patent?
Splunk Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/287. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 04 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).