Mahout的一大特色就是包含了推荐算法,里面包括了多种常见的算法,下面我们来分析分析。
针对基于用户行为数据的推荐算法一般称为协同过滤算法。协同过滤算法有基于领域(neighborhood-based)的方法,隐语义模型(latent factor model)的方法,基于图的随机游走(random walk on graph)算法。目前用的最多的就是基于领域的方法,基于领域的算法里面主要有基于用户的协同过滤算法和基于物品的协同过滤算法。下面几点摘自mahout的官网关于推荐算法的基本忠告。
- 不要一上来就来个分布式的基于Hadoop的推荐,除非必要;建议从非分布式的推荐开始,这样简单,灵活
- 最为最佳实践,系统在100M用户-物品项的级别对4G内存现代服务器来说是合适的可用的,能够作为实时推荐运行起来
- 超过了上述规模的可以考虑分布式系统,但是很多应用并没有真的有100M的数据处理。很多数据可以简化的,尽量修剪噪声和旧的数据对结果没有显著的影响
- 用户和物品是否存在真的关联,是否拥有真的用户偏好数据。如果有用户评级数据,可以考虑GenericItemBasedRecommender和PearsonCorrelationSimilarity 相似矩阵. 如果没有则考虑GenericBooleanPrefItemBasedRecommender 和 LogLikelihoodSimilarity.
如果想使用基于内容的item-item similarity,需要实现自己的ItemSimilarity. - CSV文件可以使用 FileDataModel 数据保存在数据库可以使用MySQLJDBCDataModel (PostgreSQL counterpart, etc.) and R eloadFromJDBCDataModel。
1.基于用户的协同过滤
按照官网推荐的实际,我们来个简单的基于用户的推荐引擎
1.1.准备数据 (dataset.csv)
3,12,4.5
3,13,4.0
3,14,3.0
3,15,3.5
3,16,4.5
3,17,4.0
3,18,5.0
4,10,5.0
4,11,5.0
4,12,5.0
4,13,0.0
4,14,2.0
4,15,3.0
4,16,1.0
4,17,4.0
4,18,1.0
3,13,4.0
3,14,3.0
3,15,3.5
3,16,4.5
3,17,4.0
3,18,5.0
4,10,5.0
4,11,5.0
4,12,5.0
4,13,0.0
4,14,2.0
4,15,3.0
4,16,1.0
4,17,4.0
4,18,1.0
1.2. 开始构建推荐模型
DataModel model = new FileDataModel(new File("data/dataset.csv")); //读取数据,获得数据模型
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);//构建用户相似度评价方法,这里用的是PearsonCorrelation相似度
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);//使用默认的用户近邻阈值
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);//计算得到推荐模型
List<RecommendedItem> recommendations = recommender.recommend(2, 3);//为用户2,推荐3个物品
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);//打印每个物品ID
}
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);//构建用户相似度评价方法,这里用的是PearsonCorrelation相似度
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);//使用默认的用户近邻阈值
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);//计算得到推荐模型
List<RecommendedItem> recommendations = recommender.recommend(2, 3);//为用户2,推荐3个物品
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);//打印每个物品ID
}
1.3.运行
将这个方法包装下,放在main里面就可以跑起来了。结果类似这样:
INFO - Creating FileDataModel for file data\dataset.csv
INFO - Reading file info...
INFO - Read lines: 32
INFO - Processed 4 users
DEBUG - Recommending items for user ID '2'
DEBUG - Recommendations are: [RecommendedItem[item:12, value:4.8328104], RecommendedItem[item:13, value:4.6656213], RecommendedItem[item:14, value:4.331242]]
RecommendedItem[item:12, value:4.8328104]
RecommendedItem[item:13, value:4.6656213]
RecommendedItem[item:14, value:4.331242]
INFO - Reading file info...
INFO - Read lines: 32
INFO - Processed 4 users
DEBUG - Recommending items for user ID '2'
DEBUG - Recommendations are: [RecommendedItem[item:12, value:4.8328104], RecommendedItem[item:13, value:4.6656213], RecommendedItem[item:14, value:4.331242]]
RecommendedItem[item:12, value:4.8328104]
RecommendedItem[item:13, value:4.6656213]
RecommendedItem[item:14, value:4.331242]
1.4千万级别数据来运行
生成数据,贴上代码为证:
public static void main(String[] args) {
int batchSize = 10000;
int recordsCnt = 30000000; //30M个
String fileName = "D:/tmp/recommandtestdata2.csv";
StringBuffer sb = new StringBuffer();
for (int i = 0; i < recordsCnt; i++) {
// System.out.println(i + "===" + (char) i);
if (sb == null) {
sb = new StringBuffer();
}
sb.append(getRandInt(1000000,1));// userId 1M个
sb.append(",");
sb.append(getRandInt(1000,1));// itemId
sb.append(",");
sb.append(getRandInt(5,1));// value
sb.append("\n");
if (i > 0 && (i % batchSize == 0)) {
System.out.println(i);
write2File(sb.toString(), fileName);// append data to file
sb = null;
}
}
}
int batchSize = 10000;
int recordsCnt = 30000000; //30M个
String fileName = "D:/tmp/recommandtestdata2.csv";
StringBuffer sb = new StringBuffer();
for (int i = 0; i < recordsCnt; i++) {
// System.out.println(i + "===" + (char) i);
if (sb == null) {
sb = new StringBuffer();
}
sb.append(getRandInt(1000000,1));// userId 1M个
sb.append(",");
sb.append(getRandInt(1000,1));// itemId
sb.append(",");
sb.append(getRandInt(5,1));// value
sb.append("\n");
if (i > 0 && (i % batchSize == 0)) {
System.out.println(i);
write2File(sb.toString(), fileName);// append data to file
sb = null;
}
}
}
public static int getRandInt(int max, int min) {
return (int) (Math.random() * (max - min) + min);
}
return (int) (Math.random() * (max - min) + min);
}
public static void write2File(String str, String path) {
RandomAccessFile myFileStream;
try {
myFileStream = new RandomAccessFile(path, "rw");
myFileStream.seek(myFileStream.length());
myFileStream.write((str).getBytes("UTF-8"));
// myFileStream.w
myFileStream.close();
} catch (Exception e) {
e.printStackTrace();
}
}
RandomAccessFile myFileStream;
try {
myFileStream = new RandomAccessFile(path, "rw");
myFileStream.seek(myFileStream.length());
myFileStream.write((str).getBytes("UTF-8"));
// myFileStream.w
myFileStream.close();
} catch (Exception e) {
e.printStackTrace();
}
}
直接跑,肯定会出错:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$1.apply(GenericUserPreferenceArray.java:251)
at org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$1.apply(GenericUserPreferenceArray.java:251)
修改JVM options -Xmx2512M
INFO - Creating FileDataModel for file D:\tmp\recommandtestdata2.csv
INFO - Reading file info...
INFO - Processed 1000000 lines
INFO - Processed 2000000 lines
。。。。。。(此处省略)
INFO - Processed 28000000 lines
INFO - Processed 29000000 lines
INFO - Read lines: 29990001
INFO - Processed 10000 users
INFO - Processed 20000 users
INFO - Reading file info...
INFO - Processed 1000000 lines
INFO - Processed 2000000 lines
。。。。。。(此处省略)
INFO - Processed 28000000 lines
INFO - Processed 29000000 lines
INFO - Read lines: 29990001
INFO - Processed 10000 users
INFO - Processed 20000 users
。。。。。。(此处省略)
INFO - Processed 980000 users
INFO - Processed 990000 users
INFO - Processed 999999 users
DEBUG - Recommending items for user ID '2'
DEBUG - Recommendations are: [RecommendedItem[item:458, value:2.5922961], RecommendedItem[item:842, value:2.5879922], RecommendedItem[item:802, value:2.5861814]]
RecommendedItem[item:458, value:2.5922961]
RecommendedItem[item:842, value:2.5879922]
RecommendedItem[item:802, value:2.5861814]
INFO - Processed 980000 users
INFO - Processed 990000 users
INFO - Processed 999999 users
DEBUG - Recommending items for user ID '2'
DEBUG - Recommendations are: [RecommendedItem[item:458, value:2.5922961], RecommendedItem[item:842, value:2.5879922], RecommendedItem[item:802, value:2.5861814]]
RecommendedItem[item:458, value:2.5922961]
RecommendedItem[item:842, value:2.5879922]
RecommendedItem[item:802, value:2.5861814]
说明:本人是在个人PC(64位)上跑的,这样足够说明开始介绍的Mahout官网说的没错,100M的单机就可以了,刚刚的!