Pro Membership

Pro Membership is a membership plan created by the author to maintain and update this tutorial. You can get more benefits and services, click to view details.

31. Density Clustering to Mark Abnormal Shared Bicycles#

31.1. Introduction#

This challenge will examine the application of density clustering. We will attempt to use density clustering to visualize the location distribution of shared bicycles and, at the same time, use different parameters to mark the locations of abnormal shared bicycles.

31.2. Key Points#

Determination of DBSCAN parameters
HDBSCAN clustering

Today, shared bicycles can be found everywhere on the streets and alleys, which truly facilitates the short-distance travel of citizens. However, if you are an operator of a shared bicycle company, would you consider such a question: where have all the shared bicycles put into the city by the company gone?

Of course, this question is not to satisfy your curiosity, but to adjust the operation strategy in a timely manner by tracking the distribution of shared bicycles. For example, if the density of bicycles in some locations is too high, then consideration should be given to moving them to areas with low density but demand.

Therefore, in today’s challenge, the density clustering method will be used to track the distribution of shared bicycles.

We obtained a GPS scatter dataset of shared bicycles in a certain area of Beijing, and the name of this dataset is challenge-9-bike.csv. First, load and preview this dataset.

                          wget -nc https://cdn.aibydoing.com/aibydoing/files/challenge-9-bike.csv

                        

                          import pandas as pd
import numpy as np

df = pd.read_csv("challenge-9-bike.csv")
df.describe()

	lat	lon
count	3000.000000	3000.000000
mean	39.908308	116.474630
std	0.007702	0.018098
min	39.893939	116.434264
25%	39.902769	116.461276
50%	39.907888	116.477683
75%	39.914482	116.490274
max	39.923023	116.501467

Among them, lat is the abbreviation of latitude, representing latitude, and lon is the abbreviation of longitude, representing longitude. Thus, we can use Matplotlib to plot the distribution of shared bicycles in this area.

                          from matplotlib import pyplot as plt

%matplotlib inline

plt.figure(figsize=(15, 8))
plt.scatter(df["lat"], df["lon"], alpha=0.6)

<matplotlib.collections.PathCollection at 0x121cb4ac0>

../_images/7f7edd47cecfbc4f4aa3636e76df50a9dc6d9ea9f3e28abcd32386ef5a0848e2.png

Next, we try to use the DBSCAN density clustering algorithm to cluster the shared bicycles and see the distribution of high-density areas of shared bicycles. (It may not work, but it has no impact on the challenge)

According to the experiment in the previous section, the two key parameters of the DBSCAN algorithm are eps and the density threshold MinPts. So, what are the appropriate values for these two parameters?

Exercise 31.1

Challenge: Use the DBSCAN algorithm to complete the density clustering of the GPS scatter data of shared bicycles, and determine the eps and min_samples parameters.

Regulation: Assume that there are 10 vehicles within a radius of 100 meters as a high-density area.

Hint: The challenge takes the change in latitude as a reference. Roughly estimate that a 1-degree change in latitude corresponds to a ground distance of 100 km in this area.

                          from sklearn.cluster import DBSCAN

## 代码开始 ### (≈ 2 行代码)

## 代码结束 ###
dbscan_c  # 输出聚类标签

Solution to Exercise 31.1

                          from sklearn.cluster import DBSCAN

### Code start ### (≈ 2 lines of code)
dbscan_m = DBSCAN(eps=0.001, min_samples=10)
dbscan_c = dbscan_m.fit_predict(df)
### Code end ###
dbscan_c  # Output the clustering labels

                        

Run the test

                          np.mean(dbscan_c)

                        

Expected output

6.977333333333333

Exercise 31.2

Challenge: For the data after clustering above, redraw the scatter plot as required.

Specification: The unclustered outliers are presented as blue data points with alpha=0.1, and the clustered data is presented by category with cmap='viridis'.

                          ## 代码开始 ### (≈ 4~8 行代码)
plt.figure(figsize=(15, 8))
## 代码结束 ###

                        

Solution to Exercise 31.2

                          ### Code start ### (≈ 4 - 8 lines of code)
plt.figure(figsize=(15, 8))
df_c = pd.concat([df, pd.DataFrame(dbscan_c, columns=['clusters'])], axis=1)

df_n = df_c[df_c['clusters']!=-1]
df_o = df_c[df_c['clusters']==-1]

plt.figure(figsize=(15,8))
plt.scatter(df_n['lat'], df_n['lon'], c=df_n['clusters'], cmap='viridis')
plt.scatter(df_o['lat'], df_o['lon'], alpha=.1, c='b')
### Code end ###

                        

Expected output

https://cdn.aibydoing.com/aibydoing/images/document-uid214893labid6102timestamp1531806489365.png

It can be seen from the above figure the distribution of bike density in different areas.

The HDBSCAN algorithm often does more than just perform clustering. Due to its inherent characteristics, it is often used to identify outliers as well. In this experiment, we can also identify shared bikes with abnormal locations by adjusting the parameters.

Exercise 31.3

Challenge: For the data after clustering, plot the outliers (not meeting the condition of having 2 shared bikes within a radius of 100 meters) on a scatter plot.

Requirement: The unclustered boundary points are presented as red data points, and the clustered data is presented by category with alpha = 0.1 and cmap='viridis'.

                          ## 代码开始 ### (≈ 6~10 行代码)
plt.figure(figsize=(15, 8))
## 代码结束 ###

                        

Solution to Exercise 31.3

                          ### Code start ### (≈ 6 - 10 lines of code)
plt.figure(figsize=(15, 8))
plt.figure(figsize=(15,8))

dbscan_m = DBSCAN(eps=0.001, min_samples=2)
dbscan_c = dbscan_m.fit_predict(df)
dbscan_c

df_c = pd.concat([df, pd.DataFrame(dbscan_c, columns=['clusters'])], axis=1)

df_n = df_c[df_c['clusters']!=-1]
df_o = df_c[df_c['clusters']==-1]

plt.scatter(df_n['lat'], df_n['lon'], c=df_n['clusters'], cmap='viridis', alpha=.1)
plt.scatter(df_o['lat'], df_o['lon'], c='r')
### Code end ###

                        

Expected output

https://cdn.aibydoing.com/aibydoing/images/document-uid214893labid6102timestamp1531806489629.png

This challenge mainly focuses on understanding how to quickly determine the initial parameters of DBSCAN and the method of using this algorithm to mark outliers. If you are interested, you can also try using HDBSCAN for clustering by yourself and compare the clustering effects of the two. Of course, before that, you need to install the hdbscan module using the method in the experiment.

○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.

If you find this content helpful, you can buy me a coffee

30. Implementation and Application of Density Clustering Method

32. Application of Spectral Clustering and Other Clustering Methods