淘先锋技术网

首页 1 2 3 4 5 6 7

尽管已经接受了另一个答案,但我还是发布了我的答案;接受的答案依赖于deprecated function;此外,这个被弃用的函数基于奇异值分解(SVD),它(尽管完全有效)是计算PCA的两种通用技术中内存和处理器更密集的。这在这里特别相关,因为OP中的数据数组的大小。使用基于协方差的PCA,计算流中使用的数组只是144x 144,而不是26424x144(原始数据数组的维数)。

这里有一个使用SciPy中的linalg模块的PCA的简单工作实现。由于此实现首先计算协方差矩阵,然后在此阵列上执行所有后续计算,因此它使用的内存远远少于基于SVD的PCA。

(除了import语句之外,NumPy中的linalg模块也可以在下面的代码没有任何更改的情况下使用,该语句来自NumPy import linalg as LA

本PCA实施的两个关键步骤是:计算协方差矩阵

取这个矩阵的特征值

在下面的函数中,参数dims_rescaled_data是指rescaled数据矩阵中所需的维数;该参数的默认值只有二维,但下面的代码不限于二维,但可以是小于原始数据数组列数的任意值。def PCA(data, dims_rescaled_data=2):

"""

returns: data transformed in 2 dims/columns + regenerated original data

pass in: data as 2D NumPy array

"""

import numpy as NP

from scipy import linalg as LA

m, n = data.shape

# mean center the data

data -= data.mean(axis=0)

# calculate the covariance matrix

R = NP.cov(data, rowvar=False)

# calculate eigenvectors & eigenvalues of the covariance matrix

# use 'eigh' rather than 'eig' since R is symmetric,

# the performance gain is substantial

evals, evecs = LA.eigh(R)

# sort eigenvalue in decreasing order

idx = NP.argsort(evals)[::-1]

evecs = evecs[:,idx]

# sort eigenvectors according to same index

evals = evals[idx]

# select the first n eigenvectors (n is desired dimension

# of rescaled data array, or dims_rescaled_data)

evecs = evecs[:, :dims_rescaled_data]

# carry out the transformation on the data using eigenvectors

# and return the re-scaled data, eigenvalues, and eigenvectors

return NP.dot(evecs.T, data.T).T, evals, evecs

def test_PCA(data, dims_rescaled_data=2):

'''

test by attempting to recover original data array from

the eigenvectors of its covariance matrix & comparing that

'recovered' array with the original data

'''

_ , _ , eigenvectors = PCA(data, dim_rescaled_data=2)

data_recovered = NP.dot(eigenvectors, m).T

data_recovered += data_recovered.mean(axis=0)

assert NP.allclose(data, data_recovered)

def plot_pca(data):

from matplotlib import pyplot as MPL

clr1 = '#2026B2'

fig = MPL.figure()

ax1 = fig.add_subplot(111)

data_resc, data_orig = PCA(data)

ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1)

MPL.show()

>>> # iris, probably the most widely used reference data set in ML

>>> df = "~/iris.csv"

>>> data = NP.loadtxt(df, delimiter=',')

>>> # remove class labels

>>> data = data[:,:-1]

>>> plot_pca(data)

下图是虹膜数据上此PCA函数的可视化表示。如您所见,2D转换清晰地将类I与类II和类III分离(但类II与类III不分离,类III实际上需要另一个维度)。