Method and system for implementing efficient classification and exploration of data

US10127301B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10127301-B2
Application numberUS-201514863994-A
CountryUS
Kind codeB2
Filing dateSep 24, 2015
Priority dateSep 26, 2014
Publication dateNov 13, 2018
Grant dateNov 13, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed is a system, method, and computer program product for analyzing sets of data in an efficient manner, such that analytics can be effectively performed over that data. Classification operations can be performed to generate groups of similar log records. This permits classification of the log records in a cohesive and informative manner.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: receiving a plurality of log records from a processing system, the plurality of log records comprising one or more first log records and one or more second log records; comparing the one or more first log records to the one or more second log records to determine how similar the one or more first log records are to the one or more second log records, the one or more first log records compared to the one or more second log records by independently tokenizing the one or more first log records into a first plurality of tokens and the one or more second log records into a second plurality of tokens, where a similarity value is generated that corresponds to a degree of overlap, in terms of both token content and position, between the first plurality of tokens and the second plurality of tokens; classifying the one or more first log records and the one or more second log records into a group based at least in part on the similarity value; storing, for the group, a signature comprising one or more overlapping portions that are shared by both the one or more first log records and the one or more second log records, and one or more variable portions that differ between the one or more first log records and the one or more second log records; detecting a problem on the processing system based at least in part on how the one or more first log records and the one or more second log records were classified; and performing at least one operation responsive to detecting the problem on the processing system. 2. The method of claim 1 , further comprising: storing a group identifier for the group; storing a sample log record for the group; and storing at least one of a count of the log records associated with the group, a group signature for the group, or member information for the group. 3. The method of claim 1 , wherein the similarity value is calculated as a function of a hamming distance. 4. The method of claim 1 , further comprising: counting numbers of tokens in the one or more first log records and the one or more second log records; and identifying subsets using the numbers of tokens that have been counted for the one or more first log records and the one or more second log records, wherein the one or more first log records are compared to the one or more second log records within a given subset. 5. The method of claim 1 , wherein the one or more first log records and the one or more second log records are processed as a batched group of records. 6. The method of claim 1 , wherein the plurality of log records from the processing system are processed online, the method further comprising: receiving an individual log record; comparing the individual log record to a sample log record from the group to determine a degree of overlap between the individual log record and the sample log record from the group; classifying the individual log record into a new group if the degree of overlap is less than a threshold level; and classifying the individual log record into the group if the degree of overlap is greater than the threshold level. 7. The method of claim 1 , further comprising combining multiple groups of log records together, wherein variable parts of signatures associated with the multiple groups of log records are collapsed to identify equivalent group signatures. 8. The method of claim 1 , wherein multiple processing entities operate in parallel on the one or more first log records and the one or more second log records. 9. The method of claim 8 , wherein the multiple processing entities perform at least one of tokenizing the one or more first and second records in parallel, classifying the one or more first and second records in parallel, or merging multiple groups together in parallel. 10. The method of claim 1 , further comprising identification of sequences of groups within the plurality of log records from the processing system. 11. The method of claim 1 , wherein the at least one operation comprises at least one of notifying an administrator of an anomaly, providing a suggestion on how to address the problem, or providing a forecast of future usage associated with the problem. 12. A non-transitory computer readable medium having stored thereon a sequence of instructions which, when executed by a processor causes the processor to execute a method comprising: receiving a plurality of log records from a processing system, the plurality of log records comprising one or more first log records and one or more second log records; comparing the one or more first log records to the one or more second log records to determine how similar the one or more first log records are to the one or more second log records, the one or more first log records compared to the one or more second log records by independently tokenizing the one or more first log records into a first plurality of tokens and the one or more second log records into a second plurality of tokens, where a similarity value is generated that corresponds to a degree of overlap, in terms of both token content and position, between the first plurality of tokens and the second plurality of tokens; classifying the one or more first log records and the one or more second log records into a group based at least in part on the similarity value; storing, for the group, a signature comprising one or more overlapping portions that are shared by both the one or more first log records and the one or more second log records, and one or more variable portions that differ between the one or more first log records and the one or more second log records; detecting a problem on the processing system based at least in part on how the one or more first log records and the one or more second log records were classified; and performing at least one operation responsive to detecting the problem on the processing system. 13. The non-transitory computer readable medium of claim 12 , wherein the method further comprises: storing a group identifier for the group; storing a sample log record for the group; and storing at least one of a count of the log records associated with the group, a group signature for the group, or member information for the group. 14. The non-transitory computer readable medium of claim 12 , wherein the similarity value is calculated as a function of a hamming distance. 15. The non-transitory computer readable medium of claim 12 , wherein the method further comprises: counting numbers of tokens in the one or more first log records and the one or more second log records; and identifying subsets using the numbers of tokens that have been counted for the one or more first log records and the one or more second log records, wherein the one or more first log records are compared to the one or more second log records within a given subset. 16. The non-transitory computer readable medium of claim 12 , wherein the one or more first log records and the one or more second log records are processed as a batched group of records. 17. The non-transitory computer readable medium of claim 12 , wherein the plurality of log records from the processing system are processed online, the method further comprising: receiving an individual log record; comparing the individual log record to a sample log record from the group to determine a degree of overlap between the individual log record and the sample log record from the group; classifying the individual log record into a new group if the degree of overlap is less than a threshold level; and classifying the individual log record into the group

Assignees

Inventors

Classifications

  • File access structures, e.g. distributed indices (arrangements of input from, or output to, record carriers G06F3/06) · CPC title

  • Error detection or correction of the data by redundancy in operations (error detection or correction of the data by redundancy in hardware G06F11/16) · CPC title

  • Creation or modification of classes or clusters · CPC title

  • where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting · CPC title

  • G06F16/285Primary

    Clustering or classification · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10127301B2 cover?
Disclosed is a system, method, and computer program product for analyzing sets of data in an efficient manner, such that analytics can be effectively performed over that data. Classification operations can be performed to generate groups of similar log records. This permits classification of the log records in a cohesive and informative manner.
Who is the assignee on this patent?
Oracle Int Corp
What technology area does this patent fall under?
Primary CPC classification G06F16/285. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 13 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).