读书笔记：机器学习实战【第7章：利用Adaboost元算法提高分类性能】

利用Adaboost元算法提高分类性能

当做重要决定的时候，大家可能都会考虑吸取多个专家，而非一个人的意见，同样的思想用在机器学习中就是元算法（meta-algorithm）背后的思路，元算法是对其他算法进行组合的一种方式。

7.1 基于数据集多重抽样的分类器

前面已经介绍了五种不同的分类算法，它们各有优缺点，我们自然可以将不同的分类器组合起来，而这种组合结果则被称为集成方法（ensemble method）或者元算法（meta-algorithm）。使用集成方法时会有多种形式：可以是不同算法的集成，也可以是同一算法在不同设置下的集成，还可以是数据集不同部分分配给不同分类器之后的集成。接下来，我们将介绍基于同一种分类器多个不同实例的两种计算方法。在这些方法中，数据集也会不断变化，而后应用于不同的实例分类器上。最后，我们会讨论如何利用机器学习问题的通用框架，来应用AdaBoost算法。

AdaBoost：

优点：泛化错误率低，易编码，可以应用在大部分分类器上，无参数调整。
缺点：对离群点敏感。
适用数据类型：数值型和标称型数据。

7.11 bagging：基于数据随机重抽样的分类器构建方法

自举汇聚法（bootstrap aggregating），也称为bagging方法，是在从原始数据集选择S次后得到S个新数据集的一种技术。新数据集和原数据集的大小相等，每个数据集都是通过在原始数据集中随机选择一个样本进行替换而得到的。这里的替换就意味着可以多次选择同一样本。（其实就是进行和样本个数相同次数的放回抽样，每次抽取一个样本。）这一性质允许新数据集中出现重复的值，而原始数据集的某些值在新集合中就不再出现。

在S个数据集建好之后，将某个学习算法分别作用于每个数据集，就得到了S个分类器。当我们要对新数据进行分类时，就可以应用这S个分类器进行分类。与此同时，选择分类器投票结果中最多的类别作为最后的分类结果。

7.12 boosting

boosting是一种与bagging很类似的技术，不论是在boosting还是在bagging中，所使用的多个分类器的类型都是一致的。但是在boosting中，不同的分类器是通过串行获得：每个新分类器都根据已经训练处的分类器进行训练。boosting是通过集中关注被已有分类器错分的那些数据来获得新的分类器。

由于boosting分类的结果是基于所有分类器的加权求和结果的，因此boosting和bagging不太一样，bagging中的分类器权重是相等的，而boosting中的分类器权重并不相等，每个权重代表的是其对应分类器在上一轮迭代中的成功度。

boosting方法拥有多个版本，本章将只关注其中一个最流行的版本AdaBoost。

AdaBoost的一般流程：
- 1.收集数据：可以使用任意方法
- 2.准备数据：依赖于所使用的弱分类器类型，本章使用的是单层决策树，这种分类器可以处理任何数据类型。当然也可以使用任意分类器作为弱分类器，前面讲过的任一分类器都可以。简单分类器作为弱分类器的效果会更好。
- 3.分析数据：可以使用任意方法
- 4.训练算法：AdaBoost的大部分时间用在训练上，分类器将多次在同一数据集上训练弱分类器。
- 5.测试算法：计算分类的错误率。
- 6.使用算法：同SVM一页，AdaBoost预测两个类别中的一个。如果想把它应用到多个类别，就要对AdaBoost进行修改。

7.2 训练算法：基于错误提升分类器的性能

能否使用弱分类器和多个实例构建一个强分类器？这是一个非常有趣的理论问题，这里的弱意味着分类器的性能比随机猜测要略好，但是不会好太多，就是说，在二分类情况下弱分类器的错误率会高于50%，而“强”分类器的错误率将会低很多，AdaBoost算法就脱胎于上述理论问题。

AdaBoost是adaptive boosting（自适应Boosting）的缩写，其运行过程如下：训练数据中的每个样本，并赋予其一个权重，这些权重构成了向量D。一开始，这些权重都初始化成相等值，首先在训练数据上训练一个弱分类器并计算该分类器的错误率，然后在同一数据集上再次训练弱分类器。在分类器的第二次训练当中，将会重新调整每个样本的权重，其中第一次分对的样本的权重将会降低，而第一次分错的样本的权重将会提高。

为了从所有的弱分类器中得到最终的分类结果，AdaBoost为每个分类器分配了一个权重alpha，这些alpha值是基于每个弱分类器的错误率进行计算的，其中，错误率ε的定义为：ε=（未正确分类的样本数目/所有样本数目），而alpha的计算公式如下：

α = 1 2 ln ( 1 - ε ε )

$\alpha {\rm{ = }}\frac{1}{2}\ln (\frac{{1 - \varepsilon }}{\varepsilon })$

AdaBoost算法的流程如下图所示：

这里写图片描述

计算出alpha值之后，可以对权重向量D进行更新，以使得那些正确分类的样本的权重降低而错分样本的权重升高，D的计算方法如下。

如果某个样本被正确分类，该样本的权重更改为：

D (t + 1) i = D ( t ) i e - α S u m ( D )

$D_{_i}^{(t + 1)} = \frac{{D_i^{(t)}{e^{ - \alpha }}}}{{Sum(D)}}$

如果某个样本被错误分类，该样本的权重则更改为：

D (t + 1) i = D ( t ) i e α S u m ( D )

$D_{_i}^{(t + 1)} = \frac{{D_i^{(t)}{e^{ \alpha }}}}{{Sum(D)}}$

在计算出向量D之后，AdaBoost又开始进入下一轮迭代，算法会不断重复训练和调整权重的过程，直到训练错误率为0或者弱分类器的数目达到用户指定值为止。

接下来，我们将建立完整的AdaBoost算法，在此之前，我们首先必须通过一些代码来建立弱分类器以及保存数据集的权重。

7.3 基于单层决策树构建弱分类器

单层决策树（decision stump，也称决策树桩）是一种简单的决策树，前面我们已经介绍了决策树的工作原理，接下来将构建一个单层的决策树，而它仅基于单个特征来做决策。由于这棵树只有一次分裂过程，因此它实际上就是一个树桩。

首先，加入如下代码生成数据：

import numpy as np
def loadSimData():
    datMat = np.matrix([[1.,2.1],[2.,1.1],[1.3,1.],[1.,1.],[2.,1.]])
    classLabels = [1.0,1.0,-1.0,-1.0,1.0]
    return datMat,classLabels

下图给出了上述数据集的示意图：

这里写图片描述

如果想要试着从某个坐标轴上选择一个值（即选择一条与坐标轴平行的直线），来将所有圆形点和方形点分开，这显然是不可能的。这就是单层决策树无法处理的一个著名的问题，而通过使用单棵决策树，我们就可以构建出一个能对该数据集完全正确分类的分类器。

有了数据，接下来就可以通过构建多个函数来建立单层决策树。

第一个函数将用于测试是否有某个值小于或大于我们正在测试的阈值，第二个函数复杂一些：它会在一个加权数据集中循环，并找到具有最低错误率的单层决策树。

这个程序的伪代码看起来大致如下：

将最小错误率minError设为+∞
对数据集中的每一个特征（第一层循环）：
    对每个步长（第二层循环）：
        对每个不等号（第三层循环）：
            建立一棵单层决策树并利用加权数据集对它进行测试
            如果错误率低于minError，则将当前决策树设为最佳单层决策树
返回最佳单层决策树

代码如下：

datMat,classLabels = loadSimData()

def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#数据集，检查变量维数，阈值，
    retArray = np.ones((np.shape(dataMatrix)[0],1)) #样本数量*1的权重向量
    if threshIneq == 'lt': 
        retArray[dataMatrix[:,dimen] <= threshVal] = -1.0 #将该维度的变量中所有小于等于阈值的赋-1.0
    else:
        retArray[dataMatrix[:,dimen] >  threshVal] = -1.0 #将该维度的变量中所有大于阈值的赋-1.0
    return retArray

def buildStump(dataArr,classLabels,D):
    dataMatrix = np.mat(dataArr);labelMat = np.mat(classLabels)
    m,n = np.shape(dataMatrix)
    numSteps = 10.0;bestStump = {} #注意步数设为10步
    minError = np.inf

    for i in range(n):#对于每一维的变量
        rangeMin = dataMatrix[:,i].min();rangeMax = dataMatrix[:,i]  #获得值域的上下限
        stepSize = (rangeMax - rangeMin)/numSteps #极差除以步数就是步长
        for j in range(-1,int(numSteps+1)):
            for inequal in ['lt','gt']:#分为两种情况，lt就是小于阈值的视为错误，gt则是大于阈值视为错误
                threshVal = (rangeMin+float(j)*stepSize) #最小值+步数*步长=阈值
                predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal) #获得的是一个单棵决策树预测结果
                errArr = np.mat(ones((m,1)))
                errArr[predictedVals == labelMat] = 0
                weightedError = D.T*errArr #向量相乘得到的是出错的数量
                print "split: dim %d, thresh %.2f, thresh ineqal:%s, the weighted error is %.3f" % (i,threshVal,inequal,weightedError)
                if weightedError < minError: #保存该树桩
                    minError = weightedError
                    bestClasEst = predictedVals.copy()
                    bestStump['dim'] = i
                    bestStump['thresh'] = threshVal
                    bestStump['ineq'] = inequal

    return bestStump,minError,bestClasEst

为了了解实际运行过程，输入以下命令：

In [152]: D = np.mat(np.ones((5,1)) / 5)

In [153]: buildStump(datMat,classLabels,D)
split: dim 0, thresh 0.90, thresh ineqal:lt, the weighted error is 0.400
split: dim 0, thresh 0.90, thresh ineqal:gt, the weighted error is 0.600
split: dim 0, thresh 1.00, thresh ineqal:lt, the weighted error is 0.400
split: dim 0, thresh 1.00, thresh ineqal:gt, the weighted error is 0.600
split: dim 0, thresh 1.10, thresh ineqal:lt, the weighted error is 0.400
split: dim 0, thresh 1.10, thresh ineqal:gt, the weighted error is 0.600
split: dim 0, thresh 1.20, thresh ineqal:lt, the weighted error is 0.400
split: dim 0, thresh 1.20, thresh ineqal:gt, the weighted error is 0.600
split: dim 0, thresh 1.30, thresh ineqal:lt, the weighted error is 0.200
split: dim 0, thresh 1.30, thresh ineqal:gt, the weighted error is 0.800
split: dim 0, thresh 1.40, thresh ineqal:lt, the weighted error is 0.200
split: dim 0, thresh 1.40, thresh ineqal:gt, the weighted error is 0.800
split: dim 0, thresh 1.50, thresh ineqal:lt, the weighted error is 0.200
split: dim 0, thresh 1.50, thresh ineqal:gt, the weighted error is 0.800
split: dim 0, thresh 1.60, thresh ineqal:lt, the weighted error is 0.200
split: dim 0, thresh 1.60, thresh ineqal:gt, the weighted error is 0.800
split: dim 0, thresh 1.70, thresh ineqal:lt, the weighted error is 0.200
split: dim 0, thresh 1.70, thresh ineqal:gt, the weighted error is 0.800
split: dim 0, thresh 1.80, thresh ineqal:lt, the weighted error is 0.200
split: dim 0, thresh 1.80, thresh ineqal:gt, the weighted error is 0.800
split: dim 0, thresh 1.90, thresh ineqal:lt, the weighted error is 0.200
split: dim 0, thresh 1.90, thresh ineqal:gt, the weighted error is 0.800
split: dim 0, thresh 2.00, thresh ineqal:lt, the weighted error is 0.600
split: dim 0, thresh 2.00, thresh ineqal:gt, the weighted error is 0.400
split: dim 1, thresh 0.89, thresh ineqal:lt, the weighted error is 0.400
split: dim 1, thresh 0.89, thresh ineqal:gt, the weighted error is 0.600
split: dim 1, thresh 1.00, thresh ineqal:lt, the weighted error is 0.200
split: dim 1, thresh 1.00, thresh ineqal:gt, the weighted error is 0.800
split: dim 1, thresh 1.11, thresh ineqal:lt, the weighted error is 0.400
split: dim 1, thresh 1.11, thresh ineqal:gt, the weighted error is 0.600
split: dim 1, thresh 1.22, thresh ineqal:lt, the weighted error is 0.400
split: dim 1, thresh 1.22, thresh ineqal:gt, the weighted error is 0.600
split: dim 1, thresh 1.33, thresh ineqal:lt, the weighted error is 0.400
split: dim 1, thresh 1.33, thresh ineqal:gt, the weighted error is 0.600
split: dim 1, thresh 1.44, thresh ineqal:lt, the weighted error is 0.400
split: dim 1, thresh 1.44, thresh ineqal:gt, the weighted error is 0.600
split: dim 1, thresh 1.55, thresh ineqal:lt, the weighted error is 0.400
split: dim 1, thresh 1.55, thresh ineqal:gt, the weighted error is 0.600
split: dim 1, thresh 1.66, thresh ineqal:lt, the weighted error is 0.400
split: dim 1, thresh 1.66, thresh ineqal:gt, the weighted error is 0.600
split: dim 1, thresh 1.77, thresh ineqal:lt, the weighted error is 0.400
split: dim 1, thresh 1.77, thresh ineqal:gt, the weighted error is 0.600
split: dim 1, thresh 1.88, thresh ineqal:lt, the weighted error is 0.400
split: dim 1, thresh 1.88, thresh ineqal:gt, the weighted error is 0.600
split: dim 1, thresh 1.99, thresh ineqal:lt, the weighted error is 0.400
split: dim 1, thresh 1.99, thresh ineqal:gt, the weighted error is 0.600
split: dim 1, thresh 2.10, thresh ineqal:lt, the weighted error is 0.600
split: dim 1, thresh 2.10, thresh ineqal:gt, the weighted error is 0.400
Out[153]: 
({'dim': 0, 'ineq': 'lt', 'thresh': 1.3}, matrix([[ 0.2]]), array([[-1.],
        [ 1.],
        [-1.],
        [-1.],
        [ 1.]]))

上述单层决策树的生成函数是决策树的一个简化版本，它就是所谓的弱学习器，即弱分类算法，到现在为止，我们构建了单层决策树并生成了程序，做好了过渡到完整AdaBoost的准备。

7.4 完整AdaBoost算法的实现

在上一节，我们构建了一个基于加权输入值进行决策的分类器，现在，我们拥有了实现一个完整AdaBoost算法所需要的所有信息，我们将利用7.3节构建的单层决策树来实现7.2节中给出提纲的算法。

整个实现的伪代码如下：

对每次迭代：
    利用buildStump()函数找到最佳的单层决策树
    将最佳单层决策树加入到单层决策树数组
    计算alpha
    计算新的权重向量D
    更新累计类别估计值
    如果错误率等于0.0，则退出循环

代码如下：

def adaBoostTrainDS(dataArr,classLabels,numIt=40):
    weakClassArr = []
    m = np.shape(dataArr)[0] #样本个数
    D = np.mat(np.ones((m,1))/m) #每个样本的权重，初始化为1/m
    aggClassEst = np.mat(np.zeros((m,1)))
    for i in range(numIt):#迭代次数
        bestStump,error,classEst = buildStump(dataArr,classLabels,D) #获得一个弱分类器
        print 'D:',D.T #样本权重向量D
        alpha = float(0.5*np.log((1.0-error)/max(error,1e-16)))
        bestStump['alpha'] = alpha #每个树桩（弱分类器）的alpha权重
        weakClassArr.append(bestStump) #把这个弱分类器加入弱分类器集合
        print 'classEst:',classEst.T #树桩向量

        #为下一次迭代计算D：
        expon = np.multiply(-1*alpha*np.mat(classLabels).T,classEst)
        D = np.multiply(D,np.exp(expon))
        D = D/D.sum()

        #错误率累加计算：
        aggClassEst += alpha*classEst
        print 'aggClassEst:',aggClassEst.T
        aggErrors = np.multiply(np.sign(aggClassEst) != np.mat(classLabels).T, np.ones((m,1)))
        errorRate = aggErrors.sum()/m

        print 'total error:',errorRate,'\n'
        if errorRate == 0.0:break
    return weakClassArr

输入运行命令：

In [168]: classifierArray = adaBoostTrainDS(datMat,classLabels,9)
D: [[ 0.2  0.2  0.2  0.2  0.2]]
classEst: [[-1.  1. -1. -1.  1.]]
aggClassEst: [[-0.69314718  0.69314718 -0.69314718 -0.69314718  0.69314718]]
total error: 0.2 

D: [[ 0.5    0.125  0.125  0.125  0.125]]
classEst: [[ 1.  1. -1. -1. -1.]]
aggClassEst: [[ 0.27980789  1.66610226 -1.66610226 -1.66610226 -0.27980789]]
total error: 0.2 

D: [[ 0.28571429  0.07142857  0.07142857  0.07142857  0.5       ]]
classEst: [[ 1.  1.  1.  1.  1.]]
aggClassEst: [[ 1.17568763  2.56198199 -0.77022252 -0.77022252  0.61607184]]
total error: 0.0

观察classifierArray的值：

In [169]: classifierArray
Out[169]: 
[{'alpha': 0.6931471805599453, 'dim': 0, 'ineq': 'lt', 'thresh': 1.3},
 {'alpha': 0.9729550745276565, 'dim': 1, 'ineq': 'lt', 'thresh': 1.0},
 {'alpha': 0.8958797346140273,
  'dim': 0,
  'ineq': 'lt',
  'thresh': 0.90000000000000002}]

该数组包含三部词典（树桩），其中已经包含了分类所需要的所有信息，此时，一个分类器已经构建成功，而且只要我们愿意，随时都可以将错误率降低到0，那么测试错误率如何呢？为了观察测试错误率，我们需要编写分类的一些代码。

7.5 测试算法：基于AdaBoost的分类

一旦拥有多个弱分类器以及对应的alpha值，进行测试就变得相当容易了。现在需要做的只是把弱分类器的训练过程从程序中抽出来，然后应用到某个具体的实例中去。每个弱分类器的结果以其对应的alpha值作为权重。所有这些弱分类器的结果加权求和就得到最后的结果，代码如下：

def adaClassify(datToClass,classifierArr):
    dataMatrix = np.mat(datToClass)
    m = np.shape(dataMatrix)[0]
    aggClassEst = np.mat(np.zeros((m,1)))
    for i in range(len(classifierArr)):
        classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],classifierArr[i]['thresh'],classifierArr[i]['ineq'])
        aggClassEst += classifierArr[i]['alpha']*classEst
        print aggClassEst
    return np.sign(aggClassEst)

输入运行命令：

In [174]: adaClassify([0,0],classifierArray)
[[-0.69314718]]
[[-1.66610226]]
[[-2.56198199]]
Out[174]: matrix([[-1.]])

可以看到，随着迭代的进行，数据点[0,0]的分类结果越来越强。

7.6 示例：在一个难数据集上应用AdaBoost

本节我们将在第4章给出的马疝病数据集上应用AdaBoost分类器，我们想要知道利用多个单层决策树和AdaBoost能不能预测的更加准确。

首先输入加载数据的代码：

def loadDataSet(fileName):
    dataMat = []; labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr = line.strip().split('\t')
        dataMat.append([float(lineArr[0]), float(lineArr[1])])
        labelMat.append(float(lineArr[2]))
    return dataMat,labelMat

然后运行：

In [21]: datArr,labelArr = loadDataSet('horseColicTraining2.txt')

In [22]: classifierArray = adaBoostTrainDS(datArr,labelArr,10)
total error: 0.284280936455 

total error: 0.284280936455 

total error: 0.247491638796 

total error: 0.247491638796 

total error: 0.254180602007 

total error: 0.240802675585 

total error: 0.240802675585 

total error: 0.220735785953 

total error: 0.247491638796 

total error: 0.230769230769 


In [23]: testArr,testLabelArr = loadDataSet('horseColicTest2.txt')

In [24]: prediction10 = adaClassify(testArr,classifierArray)
In [25]: errArr = np.mat(np.ones((67,1)))

In [26]: errArr[prediction10!=np.mat(testLabelArr).T].sum()
Out[26]: 16.0

In [27]: 16.0/67
Out[27]: 0.23880597014925373

最终得到了错误率0.23880597014925373，可以看到，在训练集上分类器表现出和测试集相对一致的水平。