淘先锋技术网

首页 1 2 3 4 5 6 7

Mahout版本:0.7,hadoop版本:1.0.4,jdk:1.7.0_25 64bit。

本系列分析Parallel Frequent Pattern Mining源码,本篇作为第一篇,首先进行实战,实战参考mahout官网内容。这里主要是测试sequential和mapreduce模式下对数据处理的耗时分析,使用数据为:retail.dat,前面几条数据如下:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
30 31 32 
33 34 35 
36 37 38 39 40 41 42 43 44 45 46 
38 39 47 48 
38 39 48 49 50 51 52 53 54 55 56 57 58
Parallel Frequent Pattern Mining 主程序对应的源代码是org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver。直接空参数或者使用-h参数调用FPGrowthDriver类,得到下面的算法调用参数:

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Job-Specific Options:                                                           
  --input (-i) input                                 Path to job input          
                                                     directory.                 
  --output (-o) output                               The directory pathname for 
                                                     output.                    
  --minSupport (-s) minSupport                       (Optional) The minimum     
                                                     number of times a          
                                                     co-occurrence must be      
                                                     present. Default Value: 3  
  --maxHeapSize (-k) maxHeapSize                     (Optional) Maximum Heap    
                                                     Size k, to denote the      
                                                     requirement to mine top K  
                                                     items. Default value: 50   
  --numGroups (-g) numGroups                         (Optional) Number of       
                                                     groups the features should 
                                                     be divided in the          
                                                     map-reduce version.        
                                                     Doesn't work in sequential 
                                                     version Default Value:1000 
  --splitterPattern (-regex) splitterPattern         Regular Expression pattern 
                                                     used to split given string 
                                                     transaction into itemsets. 
                                                     Default value splits comma 
                                                     separated itemsets.        
                                                     Default Value: "[          
                                                     ,\t]*[,|\t][ ,\t]*"        
  --numTreeCacheEntries (-tc) numTreeCacheEntries    (Optional) Number of       
                                                     entries in the tree cache  
                                                     to prevent duplicate tree  
                                                     building. (Warning) a      
                                                     first level conditional    
                                                     FP-Tree might consume a    
                                                     lot of memory, so keep     
                                                     this value small, but big  
                                                     enough to prevent          
                                                     duplicate tree building.   
                                                     Default Value:5            
                                                     Recommended Values: [5-10] 
  --method (-method) method                          Method of processing:      
                                                     sequential|mapreduce       
  --encoding (-e) encoding                           (Optional) The file        
                                                     encoding.  Default value:  
                                                     UTF-8                      
  --useFPG2 (-2)                                     Use an alternate FPG       
                                                     implementation             
  --help (-h)                                        Print out help             
  --tempDir tempDir                                  Intermediate output        
                                                     directory                  
  --startPhase startPhase                            First phase to run         
  --endPhase endPhase                                Last phase to run
其实,这里关心的也只是输入、输出、最小支持度、分组的个数(如果是mapreduce模式的话)而已,其他参数暂时考虑不到。

本来是打算直接在java里面直接调用的(改了mahout的源码,然后上传到hadoop-core-job.jar也是没用),而不是使用命令行调用该算法。但是调用的时候遇到了点问题,说没有权限,大概浏览了一下算法源码,发现Configuration每次都新建了一个,那我传进去的不是没有用了?晕,但是在运行sequential的时候老是提示找不到输入路径,我还以为是啥问题。没办法只有终端运行了,结果还是出错,找不到路径?明明有的,好吧:

但是我把sequential改为mapreduce,然后就可以运行了:


然后是算法的结束时间:

这样计算时间应该是:206秒,官网上面说102秒就可以跑完了,估计是我的内存设置比较小;sequential模式的601秒暂时试验不了 了。看下次有空的话在分析为啥sequential的输入文件不存在问题,估计还是和configuration有关。没时间的话就直接分析这个算法的mapreduce模式,不再分析sequential了。


ps:这个算法以前分析过,这次希望能系统的分析下,应该会比较快。


分享,成长,快乐

转载请注明blog地址:http://blog.csdn.net/fansy1990