Detecting, diagnosing, and directing solutions for source type mislabeling of machine data, including machine data that may contain PII, using machine learning

US11501112B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11501112-B1
Application numberUS-201815967435-A
CountryUS
Kind codeB1
Filing dateApr 30, 2018
Priority dateApr 30, 2018
Publication dateNov 15, 2022
Grant dateNov 15, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computerized method of diagnosing a mislabeling of a source type of a received event. The method comprising operations of receiving an event by a source type analysis logic with a data index and query system, wherein the event includes a portion of raw machine data and is associated with a specific point in time, obtaining an original source type assigned to the event and one or more predicted source types. The one or more predicted source types are determined by analysis of a data representation of the event in view of training data and the training data includes a plurality of data representations corresponding to known source types. Additionally, the computerized method also includes an operation of, determining whether the event has been mislabeled and in response to determining the event has been mislabeled, diagnosing a source of the mislabeling.

First claim

Opening claim text (preview).

What is claimed is: 1. A computerized method of diagnosing a labeling of a source type of an event using machine learning techniques, the method comprising: receiving the event by a source type analysis logic with a data index and query system, wherein the event includes a portion of raw machine data and is associated with a specific point in time; obtaining one or more predicted source types of the event, the one or more predicted source types being determined by analyzing a data representation of the event in view of training data, wherein the training data includes a plurality of data representations corresponding to known source types; determining whether the event has been mislabeled by determining whether an original source type of the event is one or more of empty, missing, or incorrect; and responsive to determining the event has been mislabeled based on a discrepancy between the original source type and the predicted source type, diagnosing a source of the mislabeling. 2. The computerized method of claim 1 , wherein the original source type is assigned according to one of a configuration file, one or more predefined rules, or a predetermined signature. 3. The computerized method of claim 1 , wherein each of the one or more predicted source types includes a probability, wherein a first probability of a first predicted source type indicates a likelihood that the first predicted source type is a correct source type of the event. 4. The computerized method of claim 1 , wherein the determining of whether the event has been mislabeled comprises comparing the original source type to a first predefined source type of the one or more predicted source types to determine whether a match exists. 5. The computerized method of claim 1 , wherein the determining of whether the event has been mislabeled comprises determining that the original source type was not assigned to the event, and selecting a predicted source type to assign to the event. 6. The computerized method of claim 1 , wherein the determining of whether the event has been mislabeled comprises: (i) determining whether a first probability of a first predicted source type of the one or more source types is greater than or equal to a first threshold, and (ii) responsive to determining the first probability is greater than or equal to the first threshold, comparing the first predicted source type with the original source type to determine whether a match exists, wherein the first probability of the first predicted source type indicates a likelihood that the event corresponds to the first predicted source. 7. The computerized method of claim 1 , wherein the determining of whether the event has been mislabeled comprises: (i) determining whether a first probability of a first predicted source type of the one or more source types is greater than or equal to a first threshold, and (ii) responsive to determining the first probability is greater than or equal the first threshold, comparing the first predicted source type with the original source type to determine whether a match exists, wherein the first probability of the first predicted source type indicates a likelihood that the event corresponds to the first predicted source, and wherein the first threshold is determined by a source type of at least one of the one or more predicted source types. 8. The computerized method of claim 1 , wherein the determining of whether at least two predicted source types of the one or more predicted source types each correspond to probabilities that are greater than or equal to a first threshold; and responsive to determining the at least two predicted source types of the one or more predicted source types each correspond to probabilities that are greater than or equal to the first threshold, generating and providing an alert to an analyst indicating the at least two predicted source types of the one or more predicted source types each correspond to probabilities that are greater than or equal to the first threshold. 9. The computerized method of claim 1 , wherein the determining of whether at least two predicted source types of the one or more predicted source types each correspond to probabilities that are greater than or equal to a first threshold; and responsive to determining the at least two predicted source types of the one or more predicted source types each correspond to probabilities that are greater than or equal to the first threshold, determining the event has been mislabeled when the original source type does not match a first source type of the at least two predicted source types having a highest probability. 10. The computerized method of claim 1 , wherein the obtaining of the one or more predicted source types of the event comprises: generating the data representation of the event includes, wherein the data representation of the event includes content of the event other than personally identifiable information, and wherein the computerized method further comprises: determining, from the data representation including the content of the event other than the personally identifiable information, that the original source type is an indicator other than a known source type, the indicator representing that the original source type is not one of a plurality of known source types. 11. The computerized method of claim 1 , wherein the obtaining of the one or more predicted source types of the event comprises: generating the data representation of the event, wherein the data representation of the event includes content of the event other than personally identifiable information, and wherein the computerized method further comprises: assigning, based on the one or more predicted source types of the event that are determined from the data representation including the content of the event other than the personally identifiable information, a predicted source type of the event when the original source type for the event is blank or missing. 12. The computerized method of claim 1 , further comprising: determining the original source type is an indicator other than a known source type, the indicator representing that the original source type is not one of a plurality of known source types; and responsive to determining the original source is the indicator other than the known source type, generating and providing an alert to an analyst indicating that the original source type is not one of a plurality of known source types. 13. The computerized method of claim 1 , further comprising: responsive to determining the event has been mislabeled, generating and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, and (iii) a first predicted source type. 14. The computerized method of claim 1 , further comprising: responsive to determining the event has been mislabeled, generating and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, (iii) the one or more predicted source types, and (iv) probabilities corresponding to each of the one or more predicted source types, wherein a first probability of a first predicted source type indicates a likelihood that the event corresponds to the first predicted source. 15. The computerized method of claim 1 , wherein diagnosing the source of the mislabeling includes determining the source of the mislabeling, and generating and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, and (iii) the one or more predicted source types, and (iv) the source of the mislabeling.

Assignees

Inventors

Classifications

  • G06F16/907Primary

    Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually · CPC title

  • Query processing · CPC title

  • characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling · CPC title

  • Root cause analysis, i.e. error or fault diagnosis (in a hardware test environment G06F11/22; in a software test environment G06F11/36) · CPC title

  • for evaluating statistical data {, e.g. average values, frequency distributions, probability functions, regression analysis (forecasting specially adapted for a specific administrative, business or logistic context G06Q10/04)} · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11501112B1 cover?
A computerized method of diagnosing a mislabeling of a source type of a received event. The method comprising operations of receiving an event by a source type analysis logic with a data index and query system, wherein the event includes a portion of raw machine data and is associated with a specific point in time, obtaining an original source type assigned to the event and one or more predicte…
Who is the assignee on this patent?
Splunk Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/907. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 15 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 9 related publications on this page (citations in our corpus or others sharing the same primary CPC).