Pro Membership

Pro Membership is a membership plan created by the author to maintain and update this tutorial. You can get more benefits and services, click to view details.

19. Support Vector Machine for Portrait Classification#

19.1. Introduction#

Support Vector Machine is a very excellent algorithm. In this challenge, we will use the Support Vector Machine method provided by scikit-learn to complete the face image classification task.

19.2. Key Points#

Image data preprocessing
Support Vector Machine classification

First, we download the face dataset through fetch_lfw_people provided by scikit-learn. The dataset was originally from the Labeled Faces in the Wild project.

                          from sklearn.datasets import fetch_lfw_people

# 加载数据集
faces = fetch_lfw_people(min_faces_per_person=60)
faces.target_names, faces.images.shape

                          (array(['Ariel Sharon', 'Colin Powell', 'Donald Rumsfeld', 'George W Bush',
        'Gerhard Schroeder', 'Hugo Chavez', 'Junichiro Koizumi',
        'Tony Blair'], dtype='<U17'),
 (1348, 62, 47))

                        

As can be seen, we only used the portrait data of 8 famous people, with a total of 1,348 samples. Among them, the size of each portrait photo is 62 * 47 pixels. The summary of the faces attribute is as follows:

Attribute	Description
`faces.images`	62x47 matrix recording pixel values in face images
`faces.data`	Converts the 62x47 matrix corresponding to `images` into a row vector
`faces.target_names`	Names of 8 portraits
`faces.target`	Sequential numbers of 8 portraits

Next, we first use Matplotlib plotting to preview this data.

Exercise 19.1

Challenge: Preview the first 5 portrait images in the dataset and present them as a subplot with 1 row and 5 columns.

Requirement: Display the name of the portrait corresponding to each image on the horizontal axis of each image.

                          from matplotlib import pyplot as plt

%matplotlib inline

## 代码开始 ### (≈4 行代码)

## 代码结束 ###

Solution to Exercise 19.1

                          from matplotlib import pyplot as plt
%matplotlib inline

### Code start ### (≈4 lines of code)
fig, axes = plt.subplots(1, 5, figsize=(12, 6))
for i, image in enumerate(faces.images[:5]):
    axes[i].imshow(image)
    axes[i].set_xlabel(faces.target_names[faces.target[i]])
### Code end ###

                        

Expected output

Since the images themselves are 2D arrays, they need to be processed before being used to train the model. Therefore, below we use the faces.data data, which has flattened the 2D array corresponding to each portrait into a 1D array.

                          faces.data.shape

                        

It can be seen that the shape of faces.data is \((1348, 2914)\), which means there are 1348 samples, and each sample corresponds to 2914 features. These 2914 features are the vectors after flattening the portrait images of \(62*47 = 2914\).

Next, as usual, it is necessary to split the dataset into a training set and a test set. However, it should be noted here that since there are only 1348 samples and each sample corresponds to 2914 features. In the process of machine learning modeling, we should avoid the situation where the number of features is much larger than the number of samples, as the models trained in this way generally perform very poorly.

Therefore, here we need to perform “dimensionality reduction” on the data features, which actually means reducing the number of data features. Here we use the PCA dimensionality reduction method, which will be introduced in detail in subsequent experiments and will not be explained here.

                          from sklearn.decomposition import PCA

# 直接运行，将数据特征缩减为 150 个
pca = PCA(n_components=150, whiten=True, random_state=42)
pca_data = pca.fit_transform(faces.data)
pca_data.shape

                        

It can be seen that the shape of the data has changed from the previous \((1348, 2914)\) to \((1348, 150)\).

Next, the training set and the test set can be split based on the dimension-reduced data.

Exercise 19.2

Challenge: Use train_test_split() to split the dataset into two parts: 80% (training set) and 20% (test set).

Specification: The training set features, test set features, training set target, and test set target are X_train, X_test, y_train, and y_test respectively, and the random seed is set to 42.

                          from sklearn.model_selection import train_test_split

## 代码开始 ### (≈1 行代码)

## 代码结束 ###

X_train.shape, X_test.shape, y_train.shape, y_test.shape

Solution to Exercise 19.2

                          from sklearn.model_selection import train_test_split

### Code starts ### (≈1 line of code)
X_train, X_test, y_train, y_test = train_test_split(
    pca_data, faces.target, test_size=0.2, random_state=42)
### Code ends ###

X_train.shape, X_test.shape, y_train.shape, y_test.shape

                        

Expected output

((1078, 150), (270, 150), (1078,), (270,))

Next, we use the SVM algorithm to build a model and train and test the model using the data split above.

Exercise 19.3

Challenge: Use the support vector machine classification method provided by scikit-learn to complete the modeling and obtain the accuracy result of the model on the test set.

Requirement: The parameters of the support vector machine classifier are C = 10 and gamma = 0.001, and the rest are default parameters.

                          ## 代码开始 ### (≈4 行代码)
model = None
## 代码结束 ###

                        

Solution to Exercise 19.3

                          ### Code starts ### (≈4 lines of code)
from sklearn.svm import SVC

model = SVC(C=10, gamma=0.001)
model.fit(X_train, y_train)
model.score(X_test, y_test)
### Code ends ###

                        

Expected output

The final accuracy > 0.8 is sufficient.

○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.

If you find this content helpful, you can buy me a coffee

18. Support Vector Machine Implementation and Application

20. Decision Tree Implementation and Application