Ashing's Blog: 機器學習(7)--利用核主成分分析(Kernel PCA)處理非線性對應

在這之前所探討的輸入數據都是假設為可線性分離的，如果輸入數據不是線性可分離的，實務上的數據往往是這樣，那麼標準PCA，所設計的線性轉換技術就不是最好的選擇，在本節介紹核PCA(kernel PCA)，它與核支援向量機(kernel SVM)類似，利用核技巧來將非線性分離的數據轉換到一個新的特徵子空間，而轉換之後便可以做線性分離。

核函數與核技巧：

類似核SVM的作法，透過核函數與核技巧，將原始數據投影到更高維的新特徵空間，而在新特徵空間，數據類別是可以線性分離的。要轉換d維空間的樣本到更高維的k維空間，我們需要定義一個非線性對應函數Φ：
如下方式，原本數據X有兩個特徵行向量，為2維d=2。利用函數Φ可以對應到3維空間k=3。

整體上來說就是，透過核PCA，我們執行一個非線性對應函數將數據轉換到一個更高維的空間，並在這個高維的空間，使用標準PCA，再將數據投影回到一個較低維度的子空間，在這個低維度的子空間中，樣本即可使用線性分類器來分離。
這個方法的缺點是非常耗費計算資源，因此我們可以使用核技巧，使用核技巧，我們可以計算兩個由原始的特徵空間所產生的高維特徵向量之間的相似度。
套用之前標準PCA的推導(此略)，我們可以定義一核矩陣(相似度矩陣)K：

核函數k則可以替代原本需使用的非線性對應函數Φ：

這樣我們就不用去計算原本需由非線性對應函數Φ去處理的樣本中兩兩數據點的內積(公式右側)，因此核函數也可以理解為，一個計算兩個向量內積的函數，也就是計算相似度的函數。而不像標準PCA，會建立一個轉換矩陣，經過核PCA處理之後，所得到的是已經投影完成的樣本。

在此我們使用的核函數為RBF(Radial Basis Function Kernel)：

實作一個「RBF核」(Radial Basis Function Kernel)的PCA，我們可以定義如下的三個步驟：

使「RBF核」的PCA來做降維的一個缺點是，我們必須先指定參數γ，

然而這並沒有固定的值，決定這值需要靠經驗。

底下利用scikit-learn 產生半月形資料及同心圓資料，分別利用標準PCA及RBF核PCA來轉換原始數據資料。

如下圖為一半月形數據，可以看出是不可線性分離的資料

<圖一>半月形數據

下圖二左側是使用標準PCA算法來轉換原始的半月形數據，而右側是使用RBF核PCA來轉換數據，可以看出左側下使用標準PCA轉換數據沒辦法做線性分離，而右側的RBF核PCA轉換後則可以線性分離。

<圖二>

<圖三>為一同心圓數據，不可線性分離

下圖四左側是使用標準PCA算法來轉換原始的同心圓數據，而右測是使用RBF核PCA來轉換數據，可以看出左側下使用標準PCA轉換數據沒辦法做線性分離，而右側的RBF核PCA轉換後則可以線性分離。

<圖四>

<RBF核PCA範例程式：>

from scipy.spatial.distance import pdist, squareform
from scipy import exp
from scipy.linalg import eigh
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from numpy import *
def rbf_kernel_pca(X, gamma, n_components):
 """
 RBF kernel PCA implementation.
 Parameters
 ------------
 X: {NumPy ndarray}, shape = [n_samples, n_features]
 gamma: float
   Tuning parameter of the RBF kernel
 n_components: int
   Number of principal components to return
 Returns
 ------------
  X_pc: {NumPy ndarray}, shape = [n_samples, k_features]
    Projected dataset   
 """
 # Calculate pairwise squared Euclidean distances
 # in the MxN dimensional dataset.
 sq_dists = pdist(X, 'sqeuclidean')
 modX=sqrt(sum(power((X[0]-X[1]),2)))
 # Convert pairwise distances into a square matrix.
 mat_sq_dists = squareform(sq_dists)
 # Compute the symmetric kernel matrix.
 K = exp(-gamma * mat_sq_dists)
 # Center the kernel matrix.
 N = K.shape[0]
 one_n = np.ones((N, N)) / N
 K = K - one_n.dot(K) - K.dot(one_n) + one_n.dot(K).dot(one_n)
 # Obtaining eigenpairs from the centered kernel matrix
 # numpy.eigh returns them in sorted order
 eigvals, eigvecs = eigh(K)
 # Collect the top k eigenvectors (projected samples)
 X_pc = np.column_stack((eigvecs[:, -i]
       for i in range(1, n_components + 1)))
 return X_pc
 
#產生半月形資料 
X, y = make_moons(n_samples=100, random_state=123)
plt.scatter(X[y == 0, 0], X[y == 0, 1], color='red', marker='^', alpha=0.5)
plt.scatter(X[y == 1, 0], X[y == 1, 1], color='blue', marker='o', alpha=0.5)
plt.tight_layout()
plt.show() 
#使用標準PCA 處裡資料  無法線性分割
scikit_pca = PCA(n_components=2)
X_spca = scikit_pca.fit_transform(X)
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(7, 3))
#X軸PC1，Y軸PC2
ax[0][0].scatter(X_spca[y == 0, 0], X_spca[y == 0, 1],
     color='red', marker='^', alpha=0.5)
ax[0][0].scatter(X_spca[y == 1, 0], X_spca[y == 1, 1],
     color='blue', marker='o', alpha=0.5)
#X軸PC1，Y軸為0+offset
ax[1][0].scatter(X_spca[y == 0, 0], np.zeros((50, 1)) + 0.02,  #Y 軸 :50個0+offset 0.2
     color='red', marker='^', alpha=0.5)
ax[1][0].scatter(X_spca[y == 1, 0], np.zeros((50, 1)) - 0.02,  #Y 軸 :50個0-offset 0.2
     color='blue', marker='o', alpha=0.5)
ax[0][0].set_xlabel('PC1')
ax[0][0].set_ylabel('PC2')
ax[1][0].set_ylim([-1, 1])
ax[1][0].set_yticks([])
ax[1][0].set_xlabel('PC1')
#使用RBF PCA 處裡資料可線性分割
from matplotlib.ticker import FormatStrFormatter
X_kpca = rbf_kernel_pca(X, gamma=15, n_components=2)
#fig, ax = plt.subplots(nrows=2,ncols=2, figsize=(7,3))
ax[0][1].scatter(X_kpca[y==0, 0], X_kpca[y==0, 1], 
   color='red', marker='^', alpha=0.5)
ax[0][1].scatter(X_kpca[y==1, 0], X_kpca[y==1, 1],
   color='blue', marker='o', alpha=0.5)

ax[1][1].scatter(X_kpca[y==0, 0], np.zeros((50,1))+0.02, 
   color='red', marker='^', alpha=0.5)
ax[1][1].scatter(X_kpca[y==1, 0], np.zeros((50,1))-0.02,
   color='blue', marker='o', alpha=0.5)

ax[0][1].set_xlabel('PC1')
ax[0][1].set_ylabel('PC2')
ax[1][1].set_ylim([-1, 1])
ax[1][1].set_yticks([])
ax[1][1].set_xlabel('PC1')
ax[0][1].xaxis.set_major_formatter(FormatStrFormatter('%0.1f'))
ax[1][1].xaxis.set_major_formatter(FormatStrFormatter('%0.1f'))

plt.tight_layout()
plt.show()

#產生同心圓資料 
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)
plt.scatter(X[y == 0, 0], X[y == 0, 1], color='red', marker='^', alpha=0.5)
plt.scatter(X[y == 1, 0], X[y == 1, 1], color='blue', marker='o', alpha=0.5)
plt.tight_layout()
# plt.savefig('./figures/circles_1.png', dpi=300)
plt.show()
#使用標準PCA 處裡資料無法線性分割
scikit_pca = PCA(n_components=2)
X_spca = scikit_pca.fit_transform(X)
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(7, 3))
ax[0][0].scatter(X_spca[y == 0, 0], X_spca[y == 0, 1],
              color='red', marker='^', alpha=0.5)
ax[0][0].scatter(X_spca[y == 1, 0], X_spca[y == 1, 1],
              color='blue', marker='o', alpha=0.5)

ax[1][0].scatter(X_spca[y == 0, 0], np.zeros((500, 1)) + 0.02,
              color='red', marker='^', alpha=0.5)
ax[1][0].scatter(X_spca[y == 1, 0], np.zeros((500, 1)) - 0.02,
              color='blue', marker='o', alpha=0.5)
ax[0][0].set_xlabel('PC1')
ax[0][0].set_ylabel('PC2')
ax[1][0].set_ylim([-1, 1])
ax[1][0].set_yticks([])
ax[1][0].set_xlabel('PC1')
#使用RBF PCA 處裡資料可線性分割
X_kpca = rbf_kernel_pca(X, gamma=15, n_components=2)
#fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(7, 3))
ax[0][1].scatter(X_kpca[y == 0, 0], X_kpca[y == 0, 1],
              color='red', marker='^', alpha=0.5)
ax[0][1].scatter(X_kpca[y == 1, 0], X_kpca[y == 1, 1],
              color='blue', marker='o', alpha=0.5)

ax[1][1].scatter(X_kpca[y == 0, 0], np.zeros((500, 1)) + 0.02,
              color='red', marker='^', alpha=0.5)
ax[1][1].scatter(X_kpca[y == 1, 0], np.zeros((500, 1)) - 0.02,
              color='blue', marker='o', alpha=0.5)
ax[0][1].set_xlabel('PC1')
ax[0][1].set_ylabel('PC2')
ax[1][1].set_ylim([-1, 1])
ax[1][1].set_yticks([])
ax[1][1].set_xlabel('PC1')
plt.tight_layout()
plt.show()

本文略過了核技巧及核矩陣(相似度矩陣)K的來源推導，有興趣請自行翻閱原書。

<參考資料>書名：Python機器學習，作者：Sebastian Raschka

https://github.com/rasbt/python-machine-learning-book

加入阿布拉機的3D列印與機器人的FB專頁
https://www.facebook.com/arbu00/

演算法(2)--使用Numpy.bincount來實作簡單的桶子排序法

機器學習(3)--適應線性神經元與梯度下降法(Adaline neuron and Gradient descent)
機器學習(4)--資料標準常態化與隨機梯度下降法( standardization & Stochastic Gradient descent)
機器學習(5)--邏輯斯迴歸，過度適合與正規化( Logistic regression，overfitting and regularization)
機器學習(6)--主成分分析(Principal component analysis，PCA)

2017年2月25日 星期六

機器學習(7)--利用核主成分分析(Kernel PCA)處理非線性對應

2017年2月25日星期六