3 Clustering Analysis
Building on the previous analysis of park and state-level tourism data, I use four key indicators for clustering analysis. The results identify four distinct categories of national parks across the United States. However, there are limitations, including the need for improved data visualization and the potential to incorporate additional indicators, such as park types and geographical areas, to enhance the analysis.
3.1 Data Pre-processing
Based on previous analysis, I input the processed data and merge it with the geometry of the centroids of national parks.
Total outdoor recreation value added (thousands of dollars) | Total outdoor recreation employment | Total outdoor recreation compensation (thousands of dollars) | Accommodation and food services value added | Accommodation and food services employment | Accommodation and food services compensation | Vehicle Trips | state_name | state_abbv | Recreation Visits | PARKNAME | UNIT_TYPE | geometry | STATE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 81495632.0 | 545448.0 | 38141606.0 | 11538652.0 | 118110.0 | 6071013.0 | 73650366.0 | California | CA | 14953882 | Golden Gate | National Recreation Area | POINT (-122.68760 37.94684) | CA |
1 | 16173790.0 | 145433.0 | 7721857.0 | 2111931.0 | 30082.0 | 1072286.0 | 178444612.0 | North Carolina | NC | 13297647 | Great Smoky Mountains | National Park | POINT (-83.49810 35.62216) | NC |
2 | 14504598.0 | 122798.0 | 7349333.0 | 1462897.0 | 17556.0 | 781365.0 | 424373204.0 | New Jersey | NJ | 8705329 | Gateway | National Recreation Area | POINT (-73.85713 40.59855) | NJ |
3 | 57803194.0 | 469357.0 | 28734838.0 | 10766934.0 | 111819.0 | 5153873.0 | 139466247.0 | Florida | FL | 8277857 | Gulf Islands | National Seashore | POINT (-87.01818 30.35539) | FL |
4 | 1489475.0 | 12470.0 | 901538.0 | 578651.0 | 4798.0 | 291929.0 | 236241049.0 | District of Columbia | DC | 8099148 | Lincoln | National Memorial | POINT (-77.05021 38.88928) | DC |
3.2 Determing the Appropriate Clusters
Data on interstate vehicle trips provides valuable insights into where visitors are coming from, often correlating with the proximity of national parks to population centers or major transportation hubs. Clustering can help account for regional tourism dynamics by grouping parks with similar visitor flows. Additionally, state-level outdoor recreation data, particularly concerning accommodation and food services, offers a broader understanding of trends, preferences, and activity levels that influence park visitation. States with a strong outdoor recreation culture may significantly contribute to national park visitation, making this data crucial for clustering. By incorporating recreation visitation data for each national park, I use four key indicators for further clustering analysis.
The Elbow Method and kneed
package determine the “knee” point quantitatively, indicating four as the number of clusters.
array([[ 2.7357308 , 2.60922961, -0.66826553, 8.42932276],
[-0.18186611, -0.24451187, 0.1986735 , 7.44326458],
[-0.25642049, -0.44099326, 2.23318497, 4.70917628],
[ 1.67750921, 2.37560822, -0.12378567, 4.454676 ],
[-0.83774033, -0.70868014, 0.67681032, 4.34827958]])
c:\Users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
c:\Users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
c:\Users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
c:\Users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
c:\Users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
c:\Users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
c:\Users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
c:\Users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
Requirement already satisfied: kneed in c:\users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages (0.8.3)
Requirement already satisfied: numpy>=1.14.2 in c:\users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages (from kneed) (1.24.4)
Requirement already satisfied: scipy>=1.0.0 in c:\users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages (from kneed) (1.14.1)
4
3.3 Perform the K-Means Fit
c:\Users\19397\.conda\envs\musa-550-fall-2023\lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
warnings.warn(
Total outdoor recreation value added (thousands of dollars) | Total outdoor recreation employment | Total outdoor recreation compensation (thousands of dollars) | Accommodation and food services value added | Accommodation and food services employment | Accommodation and food services compensation | Vehicle Trips | state_name | state_abbv | Recreation Visits | PARKNAME | UNIT_TYPE | geometry | STATE | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 81495632.0 | 545448.0 | 38141606.0 | 11538652.0 | 118110.0 | 6071013.0 | 73650366.0 | California | CA | 14953882 | Golden Gate | National Recreation Area | POINT (-122.68760 37.94684) | CA | 3 |
1 | 16173790.0 | 145433.0 | 7721857.0 | 2111931.0 | 30082.0 | 1072286.0 | 178444612.0 | North Carolina | NC | 13297647 | Great Smoky Mountains | National Park | POINT (-83.49810 35.62216) | NC | 3 |
2 | 14504598.0 | 122798.0 | 7349333.0 | 1462897.0 | 17556.0 | 781365.0 | 424373204.0 | New Jersey | NJ | 8705329 | Gateway | National Recreation Area | POINT (-73.85713 40.59855) | NJ | 3 |
3 | 57803194.0 | 469357.0 | 28734838.0 | 10766934.0 | 111819.0 | 5153873.0 | 139466247.0 | Florida | FL | 8277857 | Gulf Islands | National Seashore | POINT (-87.01818 30.35539) | FL | 3 |
4 | 1489475.0 | 12470.0 | 901538.0 | 578651.0 | 4798.0 | 291929.0 | 236241049.0 | District of Columbia | DC | 8099148 | Lincoln | National Memorial | POINT (-77.05021 38.88928) | DC | 3 |
3.4 Calculate Average Features per Cluster
label | size | |
---|---|---|
0 | 0 | 47 |
1 | 1 | 163 |
2 | 2 | 106 |
3 | 3 | 16 |
label | Total outdoor recreation value added (thousands of dollars) | Accommodation and food services value added | Vehicle Trips | Recreation Visits | |
---|---|---|---|---|---|
1 | 1 | 8.444411e+06 | 1.234140e+06 | 7.365371e+07 | 3.818086e+05 |
2 | 2 | 1.712150e+07 | 2.608128e+06 | 3.025375e+08 | 5.782773e+05 |
0 | 0 | 6.935583e+07 | 9.527813e+06 | 1.063060e+08 | 7.368459e+05 |
3 | 3 | 1.690561e+07 | 2.742586e+06 | 1.374757e+08 | 6.622410e+06 |
3.5 Coloring National Parks by their Cluster Label
Four clusters of national parks are showed on the map. However, there are still limitations, including the need for improved data visualization and the potential to incorporate additional indicators, such as park types and geographical areas, to enhance the analysis.
Total outdoor recreation value added (thousands of dollars) | Total outdoor recreation employment | Total outdoor recreation compensation (thousands of dollars) | Accommodation and food services value added | Accommodation and food services employment | Accommodation and food services compensation | Vehicle Trips | state_name | state_abbv | Recreation Visits | PARKNAME | UNIT_TYPE | geometry | STATE | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 81495632.0 | 545448.0 | 38141606.0 | 11538652.0 | 118110.0 | 6071013.0 | 73650366.0 | California | CA | 14953882 | Golden Gate | National Recreation Area | POINT (-1962854.536 -513963.852) | CA | 3 |
1 | 16173790.0 | 145433.0 | 7721857.0 | 2111931.0 | 30082.0 | 1072286.0 | 178444612.0 | North Carolina | NC | 13297647 | Great Smoky Mountains | National Park | POINT (1484845.295 -895614.898) | NC | 3 |
2 | 14504598.0 | 122798.0 | 7349333.0 | 1462897.0 | 17556.0 | 781365.0 | 424373204.0 | New Jersey | NJ | 8705329 | Gateway | National Recreation Area | POINT (2162945.286 -141062.262) | NJ | 3 |
3 | 57803194.0 | 469357.0 | 28734838.0 | 10766934.0 | 111819.0 | 5153873.0 | 139466247.0 | Florida | FL | 8277857 | Gulf Islands | National Seashore | POINT (1250105.164 -1529879.132) | FL | 3 |
4 | 1489475.0 | 12470.0 | 901538.0 | 578651.0 | 4798.0 | 291929.0 | 236241049.0 | District of Columbia | DC | 8099148 | Lincoln | National Memorial | POINT (1957863.380 -405668.953) | DC | 3 |