Breaking News

# Principal component analysis on matrix using Python

Machine learning algorithms can be time consuming to work with large data sets. To overcome this, a new dimension reduction technique has been introduced. If the input dimension is high, the main component algorithm can be used to speed up our machines. It is a projection method while retaining the characteristics of the original data.

In this article, we will discuss the basic understanding of the main component (PCA) on matrices with a python implementation. In addition, we implement this technique by applying one of the classification techniques.

### Database

The dataset can be downloaded from the following link. The dataset gives details of breast cancer patients. It has 32 features with 569 lines.

Let’s get started. Import all the libraries required for this project.

```import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline```

```dataset = pd.read_csv('cancerdataset.csv')Â
dataset["diagnosis"]=dataset["diagnosis"].map({'M': 1, 'B': 0})
data=dataset.iloc[:,0:-1]

We need to store the independent and dependent variables using the iloc method.

```X = data.iloc[:, 2:].valuesÂ
y = data.iloc[:, 1].valuesÂ ```

Divide the training and test data in the 80:20 ratio.

```from sklearn.model_selection import train_test_splitÂ
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)Â ```

### PCA standardization

PCR can only be applied to digital data. It is therefore important to convert all data to digital format. We need to normalize the data to convert the characteristics of different units into the same unit.

```from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScalerÂ
sc = StandardScaler()Â Â Â
X_train = sc.fit_transform(X_train)Â
X_test = sc.transform(X_test)Â ```

### Covariance matrix

On the basis of standardized data, we will build the covariance matrix. It gives the variance between each characteristic of our original dataset. The negative value in the result below represents are inversely dependent on each other.

```mean_vec=np.mean(X_train,axis=0)
cov_mat=(X_train-mean_vec).T.dot((X_train-mean_vec))/(X_train.shape-1)
mean_vect=np.mean(X_test,axis=0)
cov_matt=(X_test-mean_vec).T.dot((X_test-mean_vec))/(X_test.shape-1)
print(cov_mat)```

### Proper decomposition on the covariance matrix

Each eigenvector will have an eigenvalue, and the sum of the eigenvalues â€‹â€‹represents the variance in the data set. We can get the location of the maximum variance by calculating the eigenvalue. The eigenvector with the lower eigenvalue will give the smallest amount of variation in the data set. These values â€‹â€‹should be deleted.

```cov_mat=np.cov(X_train.T)
eig_vals,eig_vecs=np.linalg.eig(cov_mat)
cov_matt=np.cov(X_test.T)
eig_vals,eig_vecs=np.linalg.eig(cov_mat)
print(eig_vals)
print(eig_vecs)```

We need to specify how many components we want to keep. The result gives a dimension reduction from 32 to 2 elements. The first and second PCA will capture the most variance in the original data set.

```from sklearn.decomposition import PCA
from sklearn.decomposition import PCAÂ
pca = PCA(n_components = 2)Â
X_train = pca.fit_transform(X_train)Â
X_test = pca.transform(X_test)Â
X_train.shape```
`pca.components_`

In this matrix table, each column represents the original data and each row represents a PCA.

### Fitting the DecisionTree regression to the training set

When we solve a classification problem, we can use the decision tree classifier for model prediction.

```from sklearn.tree import DecisionTreeClassifierÂ Â Â
# Create Decision Tree classifier object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifier
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)```

### Algorithm evaluation

For classification tasks, we will use a confusion matrix to verify the correctness of our machine learning model.

```from sklearn.metrics import confusion_matrixÂ
confusion = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
confusion```

### Trace the training set

```from matplotlib.colors import ListedColormapÂ
Â Â X1, y1 = X_train, y_trainÂ
a, b = np.meshgrid(np.arange(start = X1[:, 0].min() - 1,Â
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â stop = X1[:, 0].max() + 1, step = 0.01),Â
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â np.arange(start = X1[:, 1].min() - 1,Â
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â stop = X1[:, 1].max() + 1, step = 0.01))Â
plt.contourf(a, b, clf.predict(np.array([a.ravel(),Â
Â Â Â Â Â Â Â Â Â Â Â Â Â b.ravel()]).T).reshape(a.shape), alpha = 0.75,Â
Â Â Â Â Â Â Â Â Â Â Â Â Â cmap = ListedColormap(('white')))Â
plt.xlim(a.min(), a.max())Â
plt.ylim(X2.min(), X2.max())Â
for i, j in enumerate(np.unique(y_set)):Â
Â Â Â Â plt.scatter(X1[y1 == j, 0], X1[y1 == j, 1],Â
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â c = ListedColormap(('red','blue'))(i), label = j)Â
plt.title('Decision Tree')Â
plt.xlabel('PC1') # for XlabelÂ
plt.ylabel('PC2') # for YlabelÂ
plt.legend() # to show legendÂ
# show scatter plotÂ
plt.show() Â ```

### Final thoughts

In the article above, we explained how PCA is used for downsizing a large dataset. In addition, we have explored concepts such as the covariance matrix and the proper decomposition to calculate a principal component. Hope this article is helpful to you.

`Join our Telegram Group. Be part of an engaging community` 