Mining association rules in the map-reduce framework

US10467236B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10467236-B2
Application numberUS-201414500330-A
CountryUS
Kind codeB2
Filing dateSep 29, 2014
Priority dateSep 29, 2014
Publication dateNov 5, 2019
Grant dateNov 5, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In each iteration of the process of mining association rules from transaction data by a cluster of computing systems, each mapper node in the cluster receives a split of the transaction data. Each mapper node scans the split to count an absolute support value of each candidate itemset for current search level(s), and passes the candidate itemsets and their support values to reducer nodes in the cluster. The number of reducer nodes will be determined adaptively based on the number of the candidate itemsets and the number of maximum available resource nodes in the cluster. Each reducer node combines the absolute support value of each candidate itemset, and finds frequent itemsets among them using a minimum support threshold. For each frequent itemset it finds, the reducer node creates association rule(s) satisfying a minimum confidence threshold, and exports all discovered frequent itemsets and association rules to a file system for storage.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer program product for mining association rules from transaction data by a cluster of computing systems, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions executable by a processor to cause the processor to perform a method comprising: receiving a split of a plurality of splits of the transaction data by each mapper node of a plurality of mapper nodes in the cluster; for each mapper node, scanning the split to count an absolute support value of each candidate itemset for one or more search levels, wherein the number of candidate itemsets varies at different search levels, wherein each search level is associated with a different rule size of a plurality of rule sizes, wherein each rule size of the plurality of rule sizes is a threshold number of candidate itemsets to be searched at a given search level, and wherein each successive search level of the different search levels is associated with a larger rule size than a previous search level of the different search levels; and passing the candidate itemsets and the absolute support value of each candidate itemset to a plurality of reducer nodes in the cluster; and for each reducer node, grouping the candidate itemsets and the absolute support value of each candidate itemset into a plurality of groups, wherein a number of the groups in the plurality of groups is based on a number of the candidate itemsets and a maximum number of reduce tasks that can be run in parallel in the cluster, wherein the number of reduce tasks varies at the different search levels to balance workloads in the cluster; combining the absolute support value of each candidate itemset received from each mapper node; finding frequent itemsets for each search level of the one or more search levels based on the threshold number of candidate itemsets for each search level, wherein the frequent itemsets comprise the candidate itemsets with the combined absolute support value satisfying a minimum support threshold; for each frequent itemset for the one or more search levels, creating by each reducer note one or more association rules, wherein the creating of the one or more association rules comprises: computing a confidence value for each potential association rule of a plurality of potential association rules, wherein each confidence value is based on at least two candidate itemsets being found together in one or more transactions; and creating the one or more association rules based on those potential rules whose respective confidence values satisfy the minimum confidence threshold; and exporting each frequent itemset for each of the one or more search levels and each created association rule to a file system for storage. 2. The computer program product of claim 1 , wherein for each reducer node: receiving the frequent itemsets and the absolute support values for the frequent itemsets of one or more previous search levels, wherein, for each frequent itemset for the one or more search levels, the creating of the one or more association rules satisfying the minimum confidence threshold comprises: calculating a confidence of the rule using the frequent itemsets and the absolute support values of the one or more previous search levels, and determining whether the calculated confidence of the rule satisfies the minimum confidence threshold. 3. The computer program product of claim 1 , wherein a number of the reducer nodes driven in a map-reduce job is set equal to the number of groups. 4. The computer program product of claim 1 , wherein the combining of the absolute support value of each candidate itemset received from each mapper node comprises: merging the absolute support values of the candidate itemsets in the group to form global support values for the candidate itemsets in the group, wherein the finding of the frequent itemsets comprising the candidate itemsets with the combined absolute support value satisfying the minimum support threshold comprises: pruning infrequent itemsets in the group, the infrequent itemsets comprising the global support value less than the minimum support threshold. 5. The computer program product of claim 1 , wherein the method further comprises: generating the candidate itemsets for a next search level from the frequent itemsets of the one or more search levels; and passing the candidate itemsets for the next search level to the plurality of mapper nodes. 6. The computer program product of claim 5 , wherein the method further comprises: determining whether the candidate itemsets at the next search level could be generated; determining whether a maximum rule size has been reached; and in response to determining that the candidate itemsets at the next search level could not generated, or that the maximum rule size has been reached, ending the mining of the association rules from the transaction data. 7. A system; comprising: a processor; and a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the program instructions executable by a processor to cause the processor to perform a method comprising: receiving a split of a plurality of splits of the transaction data by each mapper node of a plurality of mapper nodes in the cluster; for each mapper node, scanning the split to count an absolute support value of each candidate itemset for one or more search levels, wherein the number of candidate itemsets varies at different search levels, wherein each search level is associated with a different rule size of a plurality of rule sizes, wherein each rule size of the plurality of rule sizes is a threshold number of candidate itemsets to be searched at a given search level, and wherein each successive search level of the different search levels is associated with a larger rule size than a previous search level of the different search levels; and passing the candidate itemsets and the absolute support value of each candidate itemset to a plurality of reducer nodes in the cluster; and for each reducer node, grouping the candidate itemsets and the absolute support value of each candidate itemset into a plurality of groups, wherein a number of the groups in the plurality of groups is based on a number of the candidate itemsets and a maximum number of reduce tasks that can be run in parallel in the cluster, wherein the number of reduce tasks varies at the different search levels to balance workloads in the cluster; combining the absolute support value of each candidate itemset received from each mapper node; finding frequent itemsets for each search level of the one or more search levels based on the threshold number of candidate itemsets for each search level, wherein the frequent itemsets comprise the candidate itemsets with the combined absolute support value satisfying a minimum support threshold; for each frequent itemset for the one or more search levels, creating by each reducer note one or more association rules, wherein the creating of the one or more association rules comprises: computing a confidence value for each potential association rule of a plurality of potential association rules, wherein each confidence value is based on at least two candidate itemsets being found together in one or more transactions; and creating the one or more association rules based on those potential rules whose respective confidence values satisfy the minimum confidence threshold; and exporting each frequent itemset for each of the one or more search levels and each created association rule to

Assignees

Inventors

Classifications

  • File search processing · CPC title

  • Query processing support for facilitating data mining operations in structured databases · CPC title

  • Schema design and management · CPC title

  • Data mining · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10467236B2 cover?
In each iteration of the process of mining association rules from transaction data by a cluster of computing systems, each mapper node in the cluster receives a split of the transaction data. Each mapper node scans the split to count an absolute support value of each candidate itemset for current search level(s), and passes the candidate itemsets and their support values to reducer nodes in the…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/2465. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 05 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).