Determining an extraction rule from positive and negative examples

US11042697B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11042697-B2
Application numberUS-201916589445-A
CountryUS
Kind codeB2
Filing dateOct 1, 2019
Priority dateSep 7, 2012
Publication dateJun 22, 2021
Grant dateJun 22, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The technology disclosed relates to formulating and refining field extraction rules that are used at query time on raw data with a late-binding schema. The field extraction rules identify portions of the raw data, as well as their data types and hierarchical relationships. These extraction rules are executed against very large data sets not organized into relational structures that have not been processed by standard extraction or transformation methods. By using sample events, a focus on primary and secondary example events help formulate either a single extraction rule spanning multiple data formats, or multiple rules directed to distinct formats. Selection tools mark up the example events to indicate positive examples for the extraction rules, and to identify negative examples to avoid mistaken value selection. The extraction rules can be saved for query-time use, and can be incorporated into a data model for sets and subsets of event data.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method comprising: receiving, via a sampling tool, a user selection of a primary event and a secondary event from a plurality of events concurrently presented via a graphical user interface, wherein each event in the plurality of events includes a portion of raw data; receiving, via a selection tool, a selection of a first portion of text within the raw data of the primary event as a positive example of a first value to extract for a field that indicates a particular location in events; receiving, via the selection tool, a selection of a second portion of text within the raw data of the secondary event as a negative example of a second value not to extract for the field; and automatically determining an extraction rule that extracts the first value, but not the second value, into a set of values of the field. 2. The method of claim 1 , further comprising presenting an initial markup of the primary event comprising an indication, on the primary event, (i) of the first portion of text exacted as the first value, and (ii) of the field. 3. The method of claim 1 , further comprising: presenting an initial markup of the primary event, and an unmatched secondary event that did not match the extraction rule; receiving, via the selection tool, a selection of a third portion of text within the raw data of the unmatched secondary event as another positive example of a third value to extract for the field; and automatically updating the extraction rule to extract the first value and the third value, but not the second value, into the set of values of the field. 4. The method of claim 1 , further comprising: presenting an initial markup of the primary event and an unmatched secondary event that did not match the extraction rule; receiving, via the selection tool, a selection of a third portion of text within the raw data of the unmatched secondary event as another positive example of a third value to extract for the field; and based on receiving the selection of the third portion of text, prompting for a selection of an associated field, from a plurality of fields; based on receiving the field as the selection of the associated field, linking the third portion of text to the field. 5. The method of claim 1 , further comprising presenting a plurality of secondary events that match the extraction rule and a markup of the secondary events comprising an indication, on the secondary events, (i) of a corresponding portion of text exacted as values for the field, and (ii) of the field. 6. The method of claim 1 , further comprising accepting a selection of a second field, adjacent to the field, as an anchoring field indicating a location in the raw data of the field. 7. The method of claim 1 , further comprising accepting a selection of a third portion of text in the raw data, and requiring that the third portion of text is present in a matching event of the plurality of events for the extraction rule to succeed. 8. The method of claim 1 , further comprising presenting a plurality of secondary events that match the extraction rule and a markup of the secondary events comprising an indication, on the secondary events, of positive examples exacted as values for the field, wherein each positive example includes an associated control configured to reclassify the positive example as a negative example. 9. The method of claim 1 , further comprising: presenting a plurality of secondary events that match the extraction rule and a markup of the secondary events comprising an indication, on the secondary events, of positive examples exacted as values for the field, wherein each positive example includes an associated control configured to reclassify the positive example as a negative example; and automatically updating the extraction rule to extract the positive examples, but not reclassified negative examples, into the set of values of the field. 10. The method of claim 1 , further comprising presenting the secondary event and a markup of the secondary event comprising an indication, on the second event, of the second value registered as a negative example, a visual cue indicating registration as a negative example, and an associated control configured to undo registration as a negative example. 11. A system for generating an extraction rule, the system comprising: one or more data processors; and one or more computer-readable storage media containing instructions which when executed on the one or more data processors, cause the one or more processors to perform operations including: receiving, via a sampling tool, a user selection of a primary event and a secondary event from a plurality of events concurrently presented via a graphical user interface, wherein each event in the plurality of events includes a portion of raw data; receiving, via a selection tool, a selection of a first portion of text within the raw data of the primary event as a positive example of a first value to extract for a field that indicates a particular location in events; receiving, via the selection tool, a selection of a second portion of text within the raw data of the secondary event as a negative example of a second value not to extract for the field; and automatically determining an extraction rule that extracts the first value, but not the second value, into a set of values of the field. 12. The system of claim 11 , the operations further comprising presenting an initial markup of the primary event comprising an indication, on the primary event, (i) of the first portion of text exacted as the first value, and (ii) of the field. 13. The system of claim 11 , the operations further comprising: presenting an initial markup of the primary event, and an unmatched secondary event that did not match the extraction rule; receiving, via the selection tool, a selection of a third portion of text within the raw data of the unmatched secondary event as another positive example of a third value to extract for the field; and automatically updating the extraction rule to extract the first value and the third value, but not the second value, into the set of values of the field. 14. The system of claim 11 , the operations further comprising: presenting an initial markup of the primary event and an unmatched secondary event that did not match the extraction rule; receiving, via the selection tool, a selection of a third portion of text within the raw data of the unmatched secondary event as another positive example of a third value to extract for the field; and based on receiving the selection of the third portion of text, prompting for a selection of an associated field, from a plurality of fields; based on receiving the field as the selection of the associated field, linking the third portion of text to the field. 15. The system of claim 11 , the operations further comprising presenting a plurality of secondary events that match the extraction rule and a markup of the secondary events comprising an indication, on the secondary events, (i) of a corresponding portion of text exacted as values for the field, and (ii) of the field. 16. The system of claim 11 , the operations further comprising accepting a selection of a second field, adjacent to the field, as an anchoring field indicating a location in the raw data of the field. 17. The system of claim 11 , the operations further comprising accepting a selection of a third portion of text in the raw data, and requiring that the third portion of text is present in a matching event of the plurality of events for the

Assignees

Inventors

Classifications

  • Temporal data queries · CPC title

  • Recognition of textual entities · CPC title

  • G06F40/174Primary

    Form filling; Merging · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11042697B2 cover?
The technology disclosed relates to formulating and refining field extraction rules that are used at query time on raw data with a late-binding schema. The field extraction rules identify portions of the raw data, as well as their data types and hierarchical relationships. These extraction rules are executed against very large data sets not organized into relational structures that have not bee…
Who is the assignee on this patent?
Splunk Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/2477. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 22 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).