Advanced field extractor with multiple positive examples

US9753909B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9753909-B2
Application numberUS-201514610668-A
CountryUS
Kind codeB2
Filing dateJan 30, 2015
Priority dateSep 7, 2012
Publication dateSep 5, 2017
Grant dateSep 5, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The technology disclosed relates to formulating and refining field extraction rules that are used at query time on raw data with a late-binding schema. The field extraction rules identify portions of the raw data, as well as their data types and hierarchical relationships. These extraction rules are executed against very large data sets not organized into relational structures that have not been processed by standard extraction or transformation methods. By using sample events, a focus on primary and secondary example events help formulate either a single extraction rule spanning multiple data formats, or multiple rules directed to distinct formats. Selection tools mark up the example events to indicate positive examples for the extraction rules, and to identify negative examples to avoid mistaken value selection. The extraction rules can be saved for query-time use, and can be incorporated into a data model for sets and subsets of event data.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method comprising: accessing in memory a set of events each event identified by an associated time stamp; wherein each event in the set of events includes a portion of raw data; causing display of a first user interface including a plurality of events; receiving data indicating selection of a first event from among the plurality of events; causing display of a second user interface presenting the first event to be used to define field extraction; receiving data indicating a selection of one or more portions of text within the first event to be extracted as one or more fields; automatically determining at least one field extraction rule that extracts one or more values for the one or more fields from the respective selections of the portions of text within the events when the extraction rule is applied to the events; causing display of a third user interface including an annotated version of the plurality of events, wherein the annotated version indicates the portions of text within the plurality of events extracted by the field extraction rule and presenting a second event to be used to refine field extraction; and receiving further data indicating a selection of at least one portion of text within the second event to be extracted as into at least one of the fields by at least one updated field extraction rule. 2. The method of claim 1 , wherein the raw data is from machine data. 3. The method of claim 1 , wherein the raw data is from server data. 4. The method of claim 1 , further including: transmitting in the second user interface one or more tools that implement user selection of the one or more portions of text within the first event and naming of the one or more fields. 5. The method of claim 1 , further including: the second user interface providing tools that implement user selection of a sampling strategy to determine the events in the display; receiving further data indicating a selection of the sampling strategy; and resampling and updating the events to be displayed. 6. The method of claim 1 , further including: the second user interface providing tools that implement user selection of a sampling strategy to determine the events in the display; receiving further data indicating a selection of a diverse sampling strategy; and resampling according to the diverse sampling strategy, comprising clustering a set of events into multiple clusters, calculating a size of each cluster, and selecting one or more events from each cluster in a set of larger size clusters; and updating the events to be displayed. 7. The method of claim 1 , further including: the second user interface providing tools that implement user selection of a sampling strategy to determine the events in the display; receiving further data indicating a selection of a rare sampling strategy; and resampling according to the rare sampling strategy, comprising clustering a set of events into multiple clusters, calculating a size of each cluster, and selecting one or more events from each cluster in a set of smaller size clusters; and updating the events to be displayed. 8. The method of claim 1 , further including: the second user interface providing tools that implement user selection of a sampling strategy to determine the events in the display; receiving further data indicating a selection of a time range sampling strategy; and resampling according to the time range sampling strategy, retrieving at least a sample of events in the selected time range; and updating the events to be displayed. 9. The method of claim 1 , further including: the third user interface providing tools to select the one or more portions of text within the second event and to link the selected portions of text to the one or more fields already created. 10. The method of claim 1 , further including: the third user interface providing tools that implement user selection of events that are non-matches to the field extraction rule; receiving further data indicating a selection of a match or non-match subset of events; and resampling according to the match or non-match selection; and updating the events to be displayed. 11. The method of claim 1 , further including: before transmitting the first user interface: receiving a search specification that identifies events to be selected; causing display of a search response interface in which the events are responsive to the search specification; and the search response interface further including a user option to initiate formulation of a text extraction rule. 12. The method of claim 1 , further including: automatically determining at least one updated field extraction rule that extracts as one or more values of the one or more fields from both the first event and the second event. 13. The method of claim 1 , further including: automatically determining at least one updated field extraction rule that extracts as one or more values of the one or more fields for both the first event and the second event; and causing display of a fourth user interface including an annotated version of the plurality of events, wherein the annotated version indicates the portions of text within the plurality of events extracted by the updated field extraction rule. 14. The method of claim 1 , further including: receiving further data indicating a selection to validate the extraction rule; causing display of a fourth user interface including an annotated version of the plurality of events, wherein the annotated version indicates the portions of text within the plurality of events extracted by the field extraction rule and provides one or more user controls that implement user selection of indicated portions of the text as examples of text that should not be extracted; receiving further data indicating a selection of one or more examples of text that should not be extracted; and automatically determining at least one updated field extraction rule that does not extract the text that should not be extracted. 15. The method of claim 1 , further including: the second user interface providing tools that implement user selection of among the fields; receiving further data indicating a selection of a selected field; and transmitting data for a frequency display of values of the selected field extracted from a sample of the events, wherein the frequency display includes a list of values extracted and for each value in the list a frequency and an active filter control, wherein the active filter control filters events to be displayed based on a selected value. 16. The method of claim 1 , further including: the second user interface providing tools that implement user selection of a particular field among fields for which extraction rules have been created; receiving further data indicating a selection of a selected field; and transmitting data for a frequency display of values of the selected field extracted from a sample of the events, wherein the frequency display includes a list of values extracted and for each value in the list, frequency information and at least one filter control; receiving further data indicating a selection of a selected value from the list of values extracted and activation of the filter control; and transmitting data for a filtered display of values of the selected field extracted from an event sample filtered by the selected value. 17. The method of claim 1 , further including saving the extraction rule and field names in a configuration file for later use.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9753909B2 cover?
The technology disclosed relates to formulating and refining field extraction rules that are used at query time on raw data with a late-binding schema. The field extraction rules identify portions of the raw data, as well as their data types and hierarchical relationships. These extraction rules are executed against very large data sets not organized into relational structures that have not bee…
Who is the assignee on this patent?
Splunk Inc, Splunk Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/174. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 05 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).