尽管已经接受了另一个答案,但我还是发布了我的答案;接受的答案依赖于deprecated function;此外,这个被弃用的函数基于奇异值分解(SVD),它(尽管完全有效)是计算PCA的两种通用技术中内存和处理器更密集的。这在这里特别相关,因为OP中的数据数组的大小。使用基于协方差的PCA,计算流中使用的数组只是144x 144,而不是26424x144(原始数据数组的维数)。
这里有一个使用SciPy中的linalg模块的PCA的简单工作实现。由于此实现首先计算协方差矩阵,然后在此阵列上执行所有后续计算,因此它使用的内存远远少于基于SVD的PCA。
(除了import语句之外,NumPy中的linalg模块也可以在下面的代码没有任何更改的情况下使用,该语句来自NumPy import linalg as LA
本PCA实施的两个关键步骤是:计算协方差矩阵
取这个矩阵的特征值
在下面的函数中,参数dims_rescaled_data是指rescaled数据矩阵中所需的维数;该参数的默认值只有二维,但下面的代码不限于二维,但可以是小于原始数据数组列数的任意值。def PCA(data, dims_rescaled_data=2):
"""
returns: data transformed in 2 dims/columns + regenerated original data
pass in: data as 2D NumPy array
"""
import numpy as NP
from scipy import linalg as LA
m, n = data.shape
# mean center the data
data -= data.mean(axis=0)
# calculate the covariance matrix
R = NP.cov(data, rowvar=False)
# calculate eigenvectors & eigenvalues of the covariance matrix
# use 'eigh' rather than 'eig' since R is symmetric,
# the performance gain is substantial
evals, evecs = LA.eigh(R)
# sort eigenvalue in decreasing order
idx = NP.argsort(evals)[::-1]
evecs = evecs[:,idx]
# sort eigenvectors according to same index
evals = evals[idx]
# select the first n eigenvectors (n is desired dimension
# of rescaled data array, or dims_rescaled_data)
evecs = evecs[:, :dims_rescaled_data]
# carry out the transformation on the data using eigenvectors
# and return the re-scaled data, eigenvalues, and eigenvectors
return NP.dot(evecs.T, data.T).T, evals, evecs
def test_PCA(data, dims_rescaled_data=2):
'''
test by attempting to recover original data array from
the eigenvectors of its covariance matrix & comparing that
'recovered' array with the original data
'''
_ , _ , eigenvectors = PCA(data, dim_rescaled_data=2)
data_recovered = NP.dot(eigenvectors, m).T
data_recovered += data_recovered.mean(axis=0)
assert NP.allclose(data, data_recovered)
def plot_pca(data):
from matplotlib import pyplot as MPL
clr1 = '#2026B2'
fig = MPL.figure()
ax1 = fig.add_subplot(111)
data_resc, data_orig = PCA(data)
ax1.plot(data_resc[:, 0], data_resc[:, 1], '.', mfc=clr1, mec=clr1)
MPL.show()
>>> # iris, probably the most widely used reference data set in ML
>>> df = "~/iris.csv"
>>> data = NP.loadtxt(df, delimiter=',')
>>> # remove class labels
>>> data = data[:,:-1]
>>> plot_pca(data)
下图是虹膜数据上此PCA函数的可视化表示。如您所见,2D转换清晰地将类I与类II和类III分离(但类II与类III不分离,类III实际上需要另一个维度)。