cover

19. Support Vector Machine for Portrait Classification#

19.1. Introduction#

Support Vector Machine is a very excellent algorithm. In this challenge, we will use the Support Vector Machine method provided by scikit-learn to complete the face image classification task.

19.2. Key Points#

  • Image data preprocessing

  • Support Vector Machine classification

First, we download the face dataset through fetch_lfw_people provided by scikit-learn. The dataset was originally from the Labeled Faces in the Wild project.

from sklearn.datasets import fetch_lfw_people

# 加载数据集
faces = fetch_lfw_people(min_faces_per_person=60)
faces.target_names, faces.images.shape
(array(['Ariel Sharon', 'Colin Powell', 'Donald Rumsfeld', 'George W Bush',
        'Gerhard Schroeder', 'Hugo Chavez', 'Junichiro Koizumi',
        'Tony Blair'], dtype='<U17'),
 (1348, 62, 47))

As can be seen, we only used the portrait data of 8 famous people, with a total of 1,348 samples. Among them, the size of each portrait photo is 62 * 47 pixels. The summary of the faces attribute is as follows:

Attribute

Description

faces.images

62x47 matrix recording pixel values in face images

faces.data

Converts the 62x47 matrix corresponding to images into a row vector

faces.target_names

Names of 8 portraits

faces.target

Sequential numbers of 8 portraits

Next, we first use Matplotlib plotting to preview this data.

Exercise 19.1

Challenge: Preview the first 5 portrait images in the dataset and present them as a subplot with 1 row and 5 columns.

Requirement: Display the name of the portrait corresponding to each image on the horizontal axis of each image.

from matplotlib import pyplot as plt

%matplotlib inline

## 代码开始 ### (≈4 行代码)

## 代码结束 ###

Expected output

image

Since the images themselves are 2D arrays, they need to be processed before being used to train the model. Therefore, below we use the faces.data data, which has flattened the 2D array corresponding to each portrait into a 1D array.

faces.data.shape

It can be seen that the shape of faces.data is \((1348, 2914)\), which means there are 1348 samples, and each sample corresponds to 2914 features. These 2914 features are the vectors after flattening the portrait images of \(62*47 = 2914\).

Next, as usual, it is necessary to split the dataset into a training set and a test set. However, it should be noted here that since there are only 1348 samples and each sample corresponds to 2914 features. In the process of machine learning modeling, we should avoid the situation where the number of features is much larger than the number of samples, as the models trained in this way generally perform very poorly.

Therefore, here we need to perform “dimensionality reduction” on the data features, which actually means reducing the number of data features. Here we use the PCA dimensionality reduction method, which will be introduced in detail in subsequent experiments and will not be explained here.

from sklearn.decomposition import PCA

# 直接运行,将数据特征缩减为 150 个
pca = PCA(n_components=150, whiten=True, random_state=42)
pca_data = pca.fit_transform(faces.data)
pca_data.shape

It can be seen that the shape of the data has changed from the previous \((1348, 2914)\) to \((1348, 150)\).

Next, the training set and the test set can be split based on the dimension-reduced data.

Exercise 19.2

Challenge: Use train_test_split() to split the dataset into two parts: 80% (training set) and 20% (test set).

Specification: The training set features, test set features, training set target, and test set target are X_train, X_test, y_train, and y_test respectively, and the random seed is set to 42.

from sklearn.model_selection import train_test_split

## 代码开始 ### (≈1 行代码)

## 代码结束 ###

X_train.shape, X_test.shape, y_train.shape, y_test.shape

Expected output

((1078, 150), (270, 150), (1078,), (270,))

Next, we use the SVM algorithm to build a model and train and test the model using the data split above.

Exercise 19.3

Challenge: Use the support vector machine classification method provided by scikit-learn to complete the modeling and obtain the accuracy result of the model on the test set.

Requirement: The parameters of the support vector machine classifier are C = 10 and gamma = 0.001, and the rest are default parameters.

## 代码开始 ### (≈4 行代码)
model = None
## 代码结束 ###

Expected output

The final accuracy > 0.8 is sufficient.


○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.