admin 管理员组

文章数量: 887019


2024年3月12日发(作者:青少年机器人编程培训机构排名)

5.1 实验配置

实验环境配置为:Intel(R) Core(TM) i7-3370 CPU 3.40GHz,16G,64 位。

算法用C++实现,编译器为VS2010。操作系统为win7。Fp-Growth算法的代码

来自/article/195770,同样用C++实现。实验使用了5

个真实数据集,分别是retail,T10I4D100K,T40I10D100K,pumsb,kosarak,数据集来

自/data/,其具体参数见表2.

All the experiments are performed on a 3.40 GHz Intel(R)Core(TM) i7-3370 PC

machine with 16 G main memory, running on Microsoft Windows 7 the

programs are written in Microsoft/Visual studio implementation of

Fp-growth was download from /article/195770. we used

three real datasets and two synthetic datasets to download from

/data/ in our experiments. These datasets were often used in

previous study of frequent itemsets mining. The detailed descriptions of all datasets

are showed as follows.

表2 数据集参数

Database #Items #Trans

(a)T10I4D100K 10 1000 100000

(b)T40I10D100K 40 1000 100000

(c)pumsb 74 7117 49046

(d)retail 10 16470 88162

(e)kosarak

表中表示事务T的平均长度,#Items表示项的编号个数,#Trans表示

数据库DB中事务的总数。其中(a)(b)(d)(e)是稀疏数据,而pumsb是密集型数据。

Table 2 shows the characteristics of the real and synthetic datasets used in our

experiments,which show the average transaction length (denoted by ),the

number of items(denoted by #Items) and the number of transactions(denoted by

#Trans) in each dataset (a)(b)(d)(e) is sparse, while the dataset (c) is

dense.

图 5 total time and run time of (a)T10I4D100K,(b)T40I10D100K,(c)pumsb,(d)retail,

(e)kosarak.

从图5中我们可以看出,对于稀疏数据库来说,Fp-Search的时间复杂度明显优

于Fp-growth,如(a)(b)(d)(e),原因很简单,那就是对于稀疏数据库而言,Fp-tree

中有过多的非频繁模式,对应于Np_Graph和MP-tree中就有较多的频非频繁路

径,而Fp-growth没有能够较好地对这些非频繁模式带来的冗余计算进行剪枝,

而Fp-Search却能够很好的进行剪枝。从(c)可以看出,当Fp-tree比较密集时,

Fp-tree中几乎所有的模式都是频繁的,需要剪枝的模式很少很少,所有模式的

遍历都是必须的,那么此时Fp-Search的剪枝带来的额外时间消耗和剪枝所带来

的性能优化相当,此时两个算法的时间几乎完全一样,随着支持度减小,频繁集

的个数增加,同时Fp-tree的密集程度也不断增加,导致剪枝和构建MP-tree所

造成的时间消耗超过剪枝带来的性能优化,此时Fp-Search的性能略逊于

Fp-growth,多出的时间为构造MP-tree的时间,复杂度为树中节点的个数,因为

扩展时插入的时间复杂度为

O(1)

这个代价是极小的。

Figure 5 shows that when the dataset is sparse like (a)(b)(d)(e),the Fp-Search’s

performance will be significantly higher than is because when the

dataset is sparse, Fp-tree contains many patterns which are not frequent,and it lead to

a lot of redundant computation to construct conditional pattern base. In this case,the

corresponding Np_Graph and MP-tree also contain a lot of paths which are not

frequent. However,Fp-growth can’t prune those redundant computation, while

MP-tree has good pruning performance. From (c) this, we can see that when the

Fp-tree is intensive, almost all pattern in Fp-tree are frequent,there is a little redundant

path need to be pruned,So all traverse is necessary,thus the performance optimization

of pruning operation is not obvious,and it also need some time to prune,So it’s

running is almost equal to Fp-growth’s running time.

下面的这个实验表明了算法的线性程度,我们把tid-code的求交运算作为原

子操作,那么把所有tid-code的求交运算分为两类,一类是Necessary:求交后

判断为频繁路径时的运算;一类是Non-essential:求交后判断为非频繁路径的时

的运算。下面给出了retail和pumsb的实验结果:

图 6 算法冗余度的比较

根据图6,不难发现,对于密集型数据(c),冗余计算是极小的,这从一个侧面

反映了Fp-tree是密集型的。而对于稀疏数据(d)来讲,算法的冗余计算和必要的

计算几乎完全一样,基于篇幅,实际笔者做了大量实验发

=Necessary/Non-essential是极小的,一般维持在(0,4)之间,这说明完全

Np_Graph剪枝策略的性能很好。

6 总结

本文提出了一个新的频繁项集挖掘算法Fp-Search,它利用Np_Graph实现路

径扩展,Tid-tree实现tid-set压缩,MP-tree实现路径剪枝来高效的挖掘所有频繁

项集。它主要有以下几个优点:(1)Np_Graph能够快速的进行路径扩展;(2)

Tid-tree大大压缩tid-set;(3)MP-tree能实现高效的剪枝。实验表明算法有着很

好的性能,但仍然存在不足之处,即构造MP-tree和遍历MP-tree需要消耗一定

的时间,尽管向MP-tree中插入一个节点的时间复杂度为

O(1)

,但是当频繁集的

个数非常大时,这个消耗会成为算法的瓶颈,下一步的工作就是如何寻找一种更

好的剪枝方案来提高算法的效率。

In this paper,we propose a novel algorithm, Fp-Search, which is based on

Np_Graph for the fast path extension, Tid-tree for compressing the tid-set,and

MP-tree for pruning the advantages are shown as follows:(1)Np_Graph can

efficiently extend the path;(2)Tid-tree can greatly compress the tid-set;(3)MP-tree can

implement efficient experimental results show that the Fp-Search

algorithm has a good ,every coin has two sides,it still has

shortcomings that when constructing the MP-tree and traversing the MP-tree,some

extra time is needed,althou-

gh add a node to MP-tree need time

O(1)

,but when the number of frequent

itemsets is very large,the consumption will become the bottleneck of

finding a more efficient pruning method will be the main work in the future study.


本文标签: 剪枝 算法 实验