knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, cache = FALSE, fig.show = "asis",
    fig.align = "center") 
options(scipen=999)

# Load some libraries
library(tidyverse)
library(tidycensus)
library(stringr)
library(dplyr)
library(lubridate)
library(sf)
library(stargazer)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(corrr)      
library(kableExtra)
library(jtools)     
library(ggstance) 
library(ggpubr)    
library(broom.mixed) 
library(glue)
library(classInt)
library(viridis)
library(patchwork)
library(corrplot)


source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- colorRampPalette(c("#6D9EC1", "#E46726"))(5)

1 Introduction

An improved prediction model for home prices helps stakeholders and government agencies in making better housing policy and marketing decisions. For Zillow, one of the leading real estate companies in United States, developing an accurate, generalizable model for home prices is crucial to providing effective services to customers and enhancing the company’s competitiveness in the industry. However, this is a challenging task as numerous factors affect home prices, and no single model could cover all of them. This study seeks to create a more comprehensive home price model for Philadelphia, aiming to increase both accuracy and generalizability to benefit homebuyers, investors, the company and policymakers.

To achieve this, we applied the Hedonic Pricing Model, which accounts for internal characteristics, amenities/public services and spatial structure to determine property prices. Ordinary Least Square (OLS) regression forms the core of the model, following data cleaning, filtering and reclassification. Tests are conducted to evaluate the accuracy and generalizability of the model.

Overall, our model takes a step forward by including local characteristics like citywide spatial structures and neighborhood contexts, effectively predicts 76% of the variation in housing prices, achieves high accuracy for sales priced below $1,000,000, and controls mean absolute percentage error (MAPE) to about 36% without introducing more complex algorithms. However, the model shows less consistency and slight spatial autocorrelation, suggesting unaccounted factors. At the urban scale, North and West Philadelphia demonstrate poor predictions, indicating less generalizability towards low-priced neighborhoods and low-income tracts. Our model could be applied for predicting mid-range housing prices. Additional socioeconomic factors and non-linear regression techniques could be explored to develop the model.

2 Data

Before modeling, we processed the data through data gathering from open portals, exploratory data analysis, and feature engineering.

2.1 Data Collection

We reviewed previous studies to identify possible explanatory variables. Following the Hedonic Pricing Model, we sourced internal characteristics from the original dataset, gathered data on amenities (access to various spaces and services) through Open Data Philly portal, and accessed census data to grasp local contexts. Both continuous and categorial data are incorporated, and we chose the average distance to the 5 nearest crimes as an indicator of crime rate after comparing various metrics.(Zaki et al. 2022; Lin et al. 2019; Li et al. 2021; Aziz et al. 2023)

Data Sources: Philly neighborhoods, housing sale prices (studentData), crime incidents, hydrology, food retail access, zoning, census tables

# philly$crimes.Buffer <- philly %>% 
#     st_buffer(660) %>% 
#     aggregate(mutate(phillyCrimes, counter = 1),., sum) %>%
#     pull(counter)
philly <-
  philly %>% 
    mutate(
      # crime_nn1 = nn_function(st_coordinates(philly), 
      #                         st_coordinates(phillyCrimes), k = 1),
      # 
      # crime_nn2 = nn_function(st_coordinates(philly), 
      #                         st_coordinates(phillyCrimes), k = 2), 
      # 
      # crime_nn3 = nn_function(st_coordinates(philly), 
      #                         st_coordinates(phillyCrimes), k = 3), 
      # 
      # crime_nn4 = nn_function(st_coordinates(philly), 
      #                         st_coordinates(phillyCrimes), k = 4), 
      
      crime_nn5 = nn_function(st_coordinates(philly), 
                              st_coordinates(phillyCrimes), k = 5))

# city_hall <- metro[metro$Station == 'City Hall', 6]
# school_elem <- school %>%
#   filter(str_detect(GRADE_LEVEL, "ELEMENTARY"))
# 
# census <- census %>%
#   mutate(across(3:8, ~ as.numeric(as.character(.)))) %>%
#   mutate(NAMELSAD = str_extract(NAME, "^[^;]+")) %>%
#   select(NAMELSAD, 3:8)
# 
# calculate_nearest_distance <- function(set_points, other_layer) {
#   nearest_idx <- st_nearest_feature(set_points, other_layer)
#   st_distance(set_points, other_layer[nearest_idx, ], by_element = TRUE) %>% as.numeric()
# }
# set2 <- studentData[,74:76] %>%
#   mutate(distance_to_city_hall = st_distance(., city_hall) %>% as.numeric()) %>%
#   mutate(
#     distance_to_nearest_metro = calculate_nearest_distance(geometry, metro),
#     distance_to_nearest_hospital = calculate_nearest_distance(geometry, hospital),
#     distance_to_nearest_school = calculate_nearest_distance(geometry, school_elem),
#     distance_to_nearest_park = calculate_nearest_distance(geometry, park),
#     distance_to_nearest_water = calculate_nearest_distance(geometry, water)
#   )
# set2 <- set %>%
#   st_join(census_tract["NAMELSAD"]) %>%
#   left_join(census, by = "NAMELSAD") %>% 
#   st_join(retail[, c("LPSS_PER1000", "HPSS_PER1000")])

set <- read.csv("C:/Data/Fall 2024/CPLN 5920/Code/PPA/Predicting Housing Prices/data/set.csv")
philly_set <- merge(philly, set, by = "musaID", all.x = FALSE, all.y = FALSE, sort = FALSE) %>%
  dplyr::select(-toPredict.y) %>%
  rename(toPredict = toPredict.x) %>%
  mutate(dist1 = ifelse(distance_to_city_hall <= 20000, distance_to_city_hall, 20000))
philly <- philly_set %>% select(-distance_to_nearest_metro, -distance_to_nearest_hospital, -distance_to_nearest_school, -distance_to_nearest_park, -NAMELSAD, -Percent_Bachelor.s.degree.or.higher, -Percent_Management..business..science..and.arts.occupations, -Percent_Different.state, -Percent_With.private.health.insurance, -HPSS_PER1000)

2.2 Exploratory Data Analysis

model_philly <- philly %>%
  filter(toPredict == "MODELLING")

ggplot() +
  geom_sf(data = nhoods, fill = "grey40", color = "grey") +
  geom_sf(data = model_philly, aes(colour = q5(sale_price)),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels = qBr(model_philly, "sale_price"),
                   name="Quintile Breaks") +
  guides(color = guide_legend(override.aes = list(size = 3))) + 
  labs(title="Sale Price",
       subtitle = "Philadelphia, 2022",
       caption = "Figure 2.1") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

Sale price of homes is the dependent variable to be predicted, and Figure 2.1 showed its spatial distribution across Philadelphia. Higher prices cluster in center city and the far northeast and northwest suburbs, while lower prices concentrate in west Philadelphia and north Philadelphia. This concentric pattern reflects how the city has developed, where the wealthier residents moved to the suburbs, leaving the middle ring less developed.

stat_philly <- philly %>%
  dplyr::select(-musaID, -sale_price)

stat_philly_data <- st_drop_geometry(stat_philly)

descriptions$N <- sapply(descriptions$Field.Name, function(field) {
  sum(!is.na(stat_philly_data[[field]]))
})

selected_vars <- descriptions$Field.Name[descriptions$Variable.Type %in% c("Categorical*", "Continuous")]

descriptions <- descriptions %>% 
  rowwise() %>%
  mutate(
    Mean = ifelse(Field.Name %in% selected_vars, round(mean(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA),
    StDev = ifelse(Field.Name %in% selected_vars, round(sd(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA),
    Min = ifelse(Field.Name %in% selected_vars, round(min(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA),
    Max = ifelse(Field.Name %in% selected_vars, round(max(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA)
  ) %>%
  ungroup()

descriptions %>%
  knitr::kable(caption = "<b>Table2.1 Summary Statistics</b>", align = "ccclccccc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover")) %>%
  scroll_box(height = "300px", width = "800px")
Table2.1 Summary Statistics
Category Field.Name Variable.Type Description N Mean StDev Min Max
Internal building_code_description_new Categorical Building style. 23881 NA NA NA NA
Internal central_air Categorical “Y” indicates there is central air. 23881 NA NA NA NA
Internal general_construction Categorical General construction. 23881 NA NA NA NA
Internal quality_grade Categorical Quality grade related to building workmanship and materials. 23881 NA NA NA NA
Internal separate_utilities Categorical Whether heater, hot water tank, electrical service, and gas service are central, part separate, or separate. 23881 NA NA NA NA
Internal topography Categorical Above street level, below street level, flood plain, rocky, or other. 23881 NA NA NA NA
Internal zoning Categorical A code identifying the legal uses permitted on the property. 23881 NA NA NA NA
Internal exterior_condition Categorical* Indicates the observed exterior condition of the property. 23880 3.764 0.885 1.000 7.000
Internal fireplaces Categorical* Indicates the number of fireplaces. 23673 0.048 0.291 0.000 5.000
Internal garage_spaces Categorical* Indicates the number of garage spaces. 23457 0.357 0.705 0.000 72.000
Internal interior_condition Categorical* Represents the overall condition of the interior. 23873 3.647 0.890 0.000 7.000
Internal number_of_bathrooms Categorical* Number of bathrooms. 23622 1.072 0.676 0.000 8.000
Internal number_of_bedrooms Categorical* Number of bedrooms. 23865 2.583 1.266 0.000 31.000
Internal frontage Continuous The width of the lot where it abuts the principal street. 23751 19.653 19.419 0.000 1307.000
Internal total_livable_area Continuous Total livable area. 23881 1333.633 548.834 0.000 15246.000
Amenities crime_nn5 Continuous Average distance to the 5 nearest crimes (in ft) from the site. 23881 142.530 73.134 18.440 1772.141
Amenities dist1 Continuous Distance to city hall (in ft) from the site if less than 20000 ft, otherwise 20000 ft. 23881 16519.527 5196.091 1400.739 20000.000
Amenities distance_to_city_hall Continuous Distance to city hall (in ft) from the site. 23881 27504.777 17484.769 1400.739 79657.714
Amenities distance_to_nearest_water Continuous Distance to the nearest water body (in ft) from the site. 23881 3436.259 2039.120 19.850 9589.470
Amenities LPSS_PER1000 Continuous Number of low-produce supply stores per 1000 people. 23879 28.196 24.076 0.000 224.615
Spatial NAME Categorical Name of the neighborhood. 23881 NA NA NA NA
Spatial Estimate_Mean.family.income..dollars. Continuous Mean family income (in dollars) in the census tract where the site is located. 23881 103220.610 61176.718 25294.000 427084.000
Spatial Percent_White.alone Continuous Percentage of white (non-Hispanic) population in the census tract where the site is located. 23881 37.359 30.115 0.000 95.400

*Note: Transformed from continuous into categorical in the regression model.

The Hedonic Pricing Model divides the total cost of property into separate costs for each characteristic according to different subgroups.(Mayor et al. 2009; Scotchmer 1985) In this study, twenty-two variables are selected for the model, as summarized in Table 2.1. These include internal features like total livable area or exterior condition, amenities like crime and distance to nearest water body, as well as spatial contexts like mean family income in the census tract. The dataset’s average home has 1,333 square feet of livable space, 2.58 bedrooms and 1.07 bathrooms, with its cost at about $270,000.

numericVars <- 
  select_if(st_drop_geometry(model_philly), is.numeric) %>% na.omit() %>%
  dplyr::select(-musaID)

cor_matrix <- cor(numericVars, use = "complete.obs")

corrplot(
  cor_matrix,
  method = "color",                             
  col = colorRampPalette(c("#6D9EC1", "white", "#E46726"))(200),  
  type = "lower",                               
  tl.pos = "l",                                 
  tl.col = "black",                      
  tl.cex = 0.7,                           
  mar = c(0, 0, 3, 0)                           
)


title(main = "Correlation across numeric variables", cex.main = 1.5, font.main = 2)
mtext("Figure 2.2", side = 1, line = 4, adj = 0.2, cex = .8)

Correlation of sale price and other numeric variables are presented in Figure 2.2. Mean family income shows the strongest correlation with sale price (coefficient = 0.645), followed by total livable area. Exterior and interior conditions have the highest negative correlations. Some variables exhibit high autocorrelation, which may affect interpretability of the coefficients in the final model.

# ggplot(gather(numericVars), aes(value)) +
#   geom_histogram(bins = 50) +
#   facet_wrap(~key,nrow=4, scales = 'free_x') +
#   theme_minimal() +
#   theme(axis.text = element_blank(), axis.ticks = element_blank())

# all numeric variables
# model_philly_numeric <- st_drop_geometry(model_philly) %>% 
#   dplyr::select(sale_price, crime_nn5, 
#                 frontage, total_livable_area, 
#                 distance_to_city_hall, distance_to_nearest_water, 
#                 Estimate_Mean.family.income..dollars.,Percent_White.alone,LPSS_PER1000
#                 ) %>%
#   filter(sale_price <= 1000000) %>%
#   gather(Variable, Value, -sale_price)
# 
# ggplot(model_philly_numeric, aes(Value, sale_price)) +
#   geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#E46726") +
#   facet_wrap(~Variable, ncol = 4, scales = "free") +
#   labs(title = "Price as a function of continuous variables") +
#   theme_minimal() +
#   theme(axis.text = element_blank(), axis.ticks = element_blank())

# all factor variables
# model_philly_factor <- st_drop_geometry(model_philly) %>% 
#   dplyr::select(sale_price, 
#                 central_air, exterior_condition, fireplaces, garage_spaces, general_construction,
#                 interior_condition, number_of_bathrooms, number_of_bedrooms, quality_grade, separate_utilities,
#                 topography, zoning, building_code_description_new) %>%
#   mutate(exterior_condition = as.factor(exterior_condition),
#          fireplaces = as.factor(fireplaces),
#          garage_spaces = as.factor(garage_spaces),
#          interior_condition = as.factor(interior_condition),
#          number_of_bathrooms = as.factor(number_of_bathrooms),
#          number_of_bedrooms = as.factor(number_of_bedrooms)) %>%
#   filter(sale_price <= 1000000) %>%
#   gather(Variable, Value, -sale_price)
# 
# ggplot(model_philly_factor, aes(Value, sale_price)) +
#   geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
#   facet_wrap(~Variable, ncol = 4, scales = "free") +
#   labs(title = "Price as a function of categorical variables", y = "Mean Price") +
#   theme_minimal() +
#   theme(axis.text = element_blank(), axis.ticks = element_blank())  

## Crime cor
# model_philly %>%
#   st_drop_geometry() %>%
#   dplyr::select(sale_price, starts_with("crime")) %>%
#   filter(sale_price <= 1000000) %>%
#   gather(Variable, Value, -sale_price) %>% 
#    ggplot(aes(Value, sale_price)) +
#      geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#E46726") +
#      facet_wrap(~Variable, nrow = 1, scales = "free") +
#      labs(title = "Price as a function of crimes") +
#      theme_minimal()

2.3 Feature Engineering

Transformations of variables can sometimes improve model performance. Three types of transformations are applied in this study: logarithmic transformation, reclassification, and piecewise splitting, with examples shown below.

model_philly_numeric <- st_drop_geometry(model_philly) %>% 
  mutate(total_livable_area_log = log(total_livable_area + 1)) %>%
  dplyr::select(total_livable_area, total_livable_area_log) %>%
  filter(total_livable_area <= 2000000)

plot1 <- ggplot(model_philly_numeric, aes(x = total_livable_area)) +
  geom_histogram(bins = 30, fill = "#6D9EC1", color = "black", alpha = 0.5) +
  labs(title = "Original Total Livable Area") +
  xlim(0, 7500) + 
  coord_cartesian(ylim = c(0, 8000)) + 
  theme_minimal() +
  theme(
    plot.title = element_text(size = 12, hjust = 0.5),
    axis.title.x = element_blank()
  )


plot2 <- ggplot(model_philly_numeric, aes(x = total_livable_area_log)) +
  geom_histogram(bins = 30, fill = "#6D9EC1", color = "black", alpha = 0.5) +
  labs(title = "Log-transformed Total Livable Area") +
  xlim(5, 9) + 
  coord_cartesian(ylim = c(0, 8000)) +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 12, hjust = 0.5),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank() 
  )


combined_plot <- plot1 + plot2 + 
  plot_layout(widths = c(1, 1)) + 
  plot_annotation(
    title = "Distribution of Total Livable Area",
       caption = "Figure 2.3",
    theme = theme(plot.title = element_text(size = 18, face = "bold"),
    plot.caption = element_text(hjust = 0))
  )

combined_plot

Figure 2.3 shows a typical use of log-transformation to the variable total livable area, making a right-skewed distribution more normal, fitting the assumptions of the OLS framework better.

philly_f <- philly %>% 
  mutate(
    exterior_condition = case_when(
      is.na(exterior_condition) | exterior_condition == 1 ~ "1",
      exterior_condition %in% c(2, 3) ~ "2",
      exterior_condition == 4 ~ "3",
      TRUE ~ "4"
    ),
    exterior_condition = factor(exterior_condition)
  )

count_data <- philly %>%
  group_by(exterior_condition) %>%
  summarise(count = n())

mean_price_data <- philly %>%
  group_by(exterior_condition) %>%
  summarise(mean_sale_price = mean(sale_price, na.rm = TRUE))

count_data_f <- philly_f %>%
  group_by(exterior_condition) %>%
  summarise(count = n())

mean_price_data_f <- philly_f %>%
  group_by(exterior_condition) %>%
  summarise(mean_sale_price = mean(sale_price, na.rm = TRUE))


plot1 <- ggplot(count_data, aes(x = exterior_condition, y = count)) +
  geom_bar(stat = "identity", fill = "#6D9EC1", alpha = 0.8, width = 0.6) +
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(size = 12),
    axis.title.x = element_blank()
  )

plot2 <- ggplot(mean_price_data, aes(x = exterior_condition, y = mean_sale_price)) +
  geom_bar(stat = "identity", fill = "#E46726", alpha = 0.8, width = 0.6) +
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(size = 12),
    axis.title.x = element_blank()
  )

plot3 <- ggplot(count_data_f, aes(x = exterior_condition, y = count)) +
  geom_bar(stat = "identity", fill = "#6D9EC1", alpha = 0.8, width = 0.6) +  
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(size = 12),
    axis.title.x = element_blank()
  )

plot4 <- ggplot(mean_price_data_f, aes(x = exterior_condition, y = mean_sale_price)) +
  geom_bar(stat = "identity", fill = "#E46726", alpha = 0.8, width = 0.6) +  
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(size = 12),
    axis.title.x = element_blank()
  )


combined_plot <- (plot1 | plot2) / (plot3 | plot4) + 
  plot_annotation(
    title = "Exterior Condition and Sale Price",
    subtitle = "Before (Left) and After (Right) Reclassification",
    caption = "Figure 2.4",
    theme = theme(plot.title = element_text(size = 18, face = "bold"),
    plot.caption = element_text(hjust = 0))
  )

combined_plot

Reclassification includes converting a numeric variable into categorial one, and recoding it into fewer categories, as Figure 2.4. This applies to variables that have too few observations in some categories or have mere differences across categories, making the model more robust.

model_philly_numeric <- st_drop_geometry(model_philly) %>% 
  dplyr::select(sale_price, distance_to_city_hall) %>%
  filter(sale_price <= 1000000) %>%
  mutate(DistanceCategory = ifelse(distance_to_city_hall <= 20000, "<= 20000", "> 20000"))

plot1 <- ggplot(subset(model_philly_numeric, DistanceCategory == "<= 20000"),
                aes(distance_to_city_hall, sale_price)) +
  geom_point(size = 0.5, color = "#6D9EC1", alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#E46726") +
  ylab("Sale Price") +
  ggtitle("<= 20000") +  # 添加小标题
  scale_x_continuous(breaks = seq(0, 20000, by = 10000)) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        plot.title = element_text(size = 12, hjust = 0.5)) 

plot2 <- ggplot(subset(model_philly_numeric, DistanceCategory == "> 20000"),
                aes(distance_to_city_hall, sale_price)) +
  geom_point(size = 0.5, color = "#6D9EC1", alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#E46726") +
  ggtitle("> 20000") +  # 添加小标题
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.title = element_text(size = 12, hjust = 0.5)) 

combined_plot <- plot1 + plot2 + 
  plot_layout(widths = c(1, 4)) + 
  plot_annotation(
    title = "Price as a function of Distance to City Hall",
    caption = "Figure 2.5",
    theme = theme(plot.title = element_text(size = 18, face = 'bold'),
                  plot.caption = element_text(hjust = 0))
  )

combined_plot

Cutting continuous variables into pieces is a way of dealing with non-linear relationships, as is shown in Figure 2.5. The sale price sharply decreases and slightly goes up as distance to city hall increases, reflecting the concentric development structure discussed in Figure 2.5. In this case, we added a “dist1” variable, assigning actual values if its distance is less than 20000 ft, or 20000 ft otherwise. This transformation will help achieve the effect of piecewise function in linear regressions.

philly_fac <- philly %>%
  mutate(
    # basements = case_when(
    #   basements %in% c("0", "1", "4", "A") ~ "B",   
    #   TRUE ~ "A"                               
    # ),
    # basements = factor(basements),
    
    central_air = case_when(
      central_air %in% c("1", "Y") ~ "Y",  
      TRUE ~ "N"                          
    ),
    central_air = factor(central_air),
    
    exterior_condition = case_when(
      is.na(exterior_condition) | exterior_condition == 1 ~ "1",    
      exterior_condition %in% c(2, 3) ~ "2",                        
      exterior_condition == 4 ~ "3",                                
      TRUE ~ "4"                      
    ),
    exterior_condition = factor(exterior_condition),
    
    fireplaces = case_when(
      is.na(fireplaces) | fireplaces == 0 | fireplaces == 1 ~ "0",                    
      TRUE ~ "1"                                
    ),
    fireplaces = factor(fireplaces),
    
    # fuel = case_when(
    #   is.na(fuel) | fuel == "" ~ "X",  
    #   TRUE ~ "A"                       
    # ),
    # fuel = factor(fuel),
    
    garage_spaces = case_when(
      is.na(garage_spaces) | garage_spaces == 0 ~ "0",    
      garage_spaces == 1 ~ "1",      
      TRUE ~ "2"                     
    ),
    garage_spaces = factor(garage_spaces),
    
    general_construction = case_when(       
      general_construction %in% c("A ", "B ", "D ") ~ "A",  
      general_construction == "3 " ~ "C",        
      TRUE ~ "B"                                 
    ),
    general_construction = factor(general_construction),
    
    interior_condition = case_when(
      interior_condition %in% c(0, 1, 2, 3) ~ "1",         
      interior_condition %in% c(4, NA) ~ "2",        
      TRUE ~ "3"       
    ),
    interior_condition = factor(interior_condition),
    
    number_of_bathrooms = case_when(
      number_of_bathrooms == 0 ~ "0",                   
      number_of_bathrooms == 1 ~ "1",                   
      number_of_bathrooms %in% c(2, NA) ~ "2",          
      number_of_bathrooms == 3 ~ "3",                   
      TRUE ~ "4"                                        
    ),
    number_of_bathrooms = factor(number_of_bathrooms),
    
    number_of_bedrooms = case_when(
      number_of_bedrooms %in% c(0, 31) | is.na(number_of_bedrooms) ~ "0",  
      number_of_bedrooms %in% c(1, 2) ~ "1",    
      number_of_bedrooms == 3 ~ "2",            
      number_of_bedrooms == 4 ~ "3",            
      number_of_bedrooms %in% c(5, 6) ~ "4",    
      number_of_bedrooms %in% c(7, 8, 9) ~ "5" 
    ),
    number_of_bedrooms = factor(number_of_bedrooms),
    
    # parcel_shape = case_when(
    #   parcel_shape %in% c("C", "E") ~ "A",  
    #   parcel_shape == "A" ~ "B",            
    #   TRUE ~ "C"                            
    # ),
    # parcel_shape = factor(parcel_shape),
    
    quality_grade = case_when(
      quality_grade %in% c("A+", "A", "A-", "X", "X-") ~ "1",           
      quality_grade %in% c("B+", "B", "B-", "3", "S+") ~ "2",
      TRUE ~ "3"                                             
    ),
    quality_grade = factor(quality_grade),
    
    separate_utilities = case_when(
      separate_utilities %in% c("A", "C") ~ "B",  
      TRUE ~ "A"                                 
    ),
    separate_utilities = factor(separate_utilities),
    
    # street_direction = case_when(
    #   is.na(street_direction) | street_direction == "" | street_direction == "S" ~ "A",
    #   TRUE ~ "B"                           
    # ),
    # street_direction = factor(street_direction),
    
    topography = case_when(
      topography == "B" ~ "B",
      TRUE ~ "A"
    ),
    topography = factor(topography),
    
    # type_heater = case_when(
    #   type_heater == "D" ~ "B",            
    #   TRUE ~ "A"                           
    # ),
    # type_heater = factor(type_heater),
    
    building_code_description_new = case_when(
      building_code_description_new %in% c("COLONIAL", "OLD STYLE", "ROW MODERN", "ROW OLD STYLE", "TUDOR", "TWIN BUNGALOW") ~ "1",
      building_code_description_new %in% c("ROW POST WAR", "ROW TYPICAL", "TWIN CONVENTIONAL") ~ "3",
      building_code_description_new %in% c("OTHER", "ROW PORCH FRONT") ~ "4",
      TRUE ~ "2"
    ),
    building_code_description_new = factor(building_code_description_new),
    
    NAME = case_when(
      NAME %in% c("CHINATOWN", "CRESTMONT_FARMS", "EAST_PARK", "MECHANICSVILLE", 
                  "PENNYPACK_PARK", "UNIVERSITY_CITY", "WISSAHICKON_PARK", "WOODLAND_TERRACE") ~ "OTHERS",
      TRUE ~ NAME 
    ),
    NAME = factor(NAME)
  )
philly <- philly_fac

model_philly <- philly %>%
  filter(toPredict == "MODELLING")

st_drop_geometry(model_philly) %>% 
  dplyr::select(sale_price, total_livable_area, crime_nn5, distance_to_city_hall, Estimate_Mean.family.income..dollars.) %>%
  filter(sale_price <= 2000000 & crime_nn5 <= 1000 & total_livable_area <= 7500) %>%
  gather(Variable, Value, -sale_price) %>%
  ggplot(aes(Value, sale_price)) +
  geom_point(size = 0.5, colour = "#6D9EC1", alpha = 0.5) +  
  geom_smooth(method = "lm", se = FALSE, colour = "#E46726") + 
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price as a function of continuous variables",
       caption = "Figure 2.6") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.caption = element_text(hjust = 0)
  )

Figure 2.6 includes scatterplots for sale price against four selected continuous variables. Besides “distance to city hall” we have addressed, positive relationships between sale price and the other three variables are explicit. The points in the plots are dispersed rather than being around the line, showing the sense of Hedonic Pricing Model’s aim to value all services to define a total price, reducing the variance from any single variable.

ggplot() +
  geom_sf(data = nhoods, , fill = "grey40", color = "grey", na.rm = T) +
  geom_sf(data = philly %>% filter(!is.na(LPSS_PER1000)), aes(colour = q5(LPSS_PER1000)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(philly,"LPSS_PER1000"),
                   name="Quintile\nBreaks") +
  guides(color = guide_legend(override.aes = list(size = 3))) + 
  labs(title="Low-Produce Supply Stores",
       subtitle = "per 1000 People",
       caption = "Figure 2.7") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

Low-produce supply stores (LPSS) include dollar stores, pharmacies and convenience stores. It is a measure of accessibility to affordable stores, having high density in areas with lower income in north and west Philadelphia, as shown in Figure 2.7. When aligned with other variables, LPSS distinguishes these neighborhoods well and enhances model accuracy.

ggplot() +
  geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = T) +
  geom_sf(data = philly %>% filter(!is.na(exterior_condition)), aes(colour = exterior_condition), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = colorRampPalette(c("#E46726", "#6D9EC1"))(4)
                   ) +
  guides(color = guide_legend(override.aes = list(size = 3))) + 
  labs(title="Exterior Condition",
       caption = "Figure 2.8") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

The spatial distribution of exterior conditions is as Figure 2.8 (1 = best, 4 = worst). Homes with the best exterior conditions are close to the center city, and west Philadelphia shows a mix of both well- and ill-conditioned homes, instead of corresponding only with the wealthiness of the neighborhoods.

ggplot() +
  geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = TRUE) +
  geom_sf(data = philly %>% filter(!is.na(zoning)), aes(colour = zoning), 
          show.legend = "point", size = 0.75) +
  scale_colour_manual(values = c(
    "RM1" = "#FCB951", "RM2" = "#FCB951", "RM3" = "#FCB951", "RM4" = "#FCB951",
    "RMX1" = "#E58425", "RMX2" = "#E58425", "RMX3" = "#E58425",
    "RSA1" = "#F8EE68", "RSA2" = "#F8EE68", "RSA3" = "#F8EE68", "RSA4" = "#F8EE68", 
    "RSA5" = "#F8EE68", "RSA6" = "#F8EE68",
    "RSD1" = "#FFF5C4", "RSD2" = "#FFF5C4", "RSD3" = "#FFF5C4",
    "RTA1" = "#CDB650"
  )) +
  guides(color = guide_legend(override.aes = list(size = 3), ncol = 2)) + 
  labs(title = "Zoning", subtitle = "Philadelphia, 2022",
       caption = "Figure 2.9") +
  mapTheme() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12),
    panel.border = element_rect(colour = "black", fill = NA, size = 1)
  )

Figure 2.9 visualizes the zoning status, colored in the official colors. The most common type of zoning is RSA (Residential Single-Family Attached). RM (Residential Multi-Family), with higher density, are clustered along Board St, near center city and a couple of areas in west Philadelphia, while RSD (Residential Single-Family Detached), with lower density, are spread across far northwest and northeast suburbs.

# cor_philly <- 
#   st_drop_geometry(model_philly) %>%
#   filter(sale_price <= 1000000,
#          total_livable_area < 10000) %>%
#   dplyr::select(sale_price, total_livable_area)
# 
# cor.test(cor_philly$total_livable_area,
#          cor_philly$sale_price, 
#          method = "pearson")
# 
# 
# # Calculate predicted prices
# cor_philly$predicted_price <- predict(lm(sale_price ~ total_livable_area, data = cor_philly), newdata = cor_philly)
# ggscatter(cor_philly,
#           x = "total_livable_area",
#           y = "sale_price",
#           color = "#6D9EC1", size = 1, alpha = 0.6) +
#   geom_point(aes(y = predicted_price), color = "#E46726", size = 1, alpha = 0.6) +
#   stat_cor(label.x = 4000, label.y = 250000, hjust = 0) +
#   labs(title = "Price as a function of living area", 
#        subtitle = "With predicted prices; Sale prices <= $1 mil.",
#        x = "Total Livable Area",
#        y = "Sale Price") +
#   theme_minimal() +
#   theme(plot.title = element_text(size = 18, face = "bold"),
#         plot.subtitle = element_text(size = 12)) 

3 Methods

The core of our algorithm is Ordinary Least Square (OLS) regression model. We used 9 continuous variables and 13 reclassified categorical variables as independent variables and sale price as the dependent variable.

To build and evaluate our model, we split the dataset into two parts: 60% as the training set to fit the model, and 40% as the test set to evaluate its accuracy. The model was tested using Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and prediction v.s. actual value scatter plot.

To ensure the accuracy of the model, we applied a 100-fold cross validation process. This divides the data into 100 subsets, repeatedly testing the model to see how well it performs across different samples. Finally, we examined the generalizability of the model across different neighborhood contexts by observing the distribution of spatial lag, calculating Moran’s I (a measure of spatial clustering in model errors), and testing the model in different neighborhood settings.

Analyses were conducted using R version 4.4.1.

4 Results

Using the features selected above, our model explains 76% of the variation in price and produces average errors of 36% relative to housing prices. In terms of cross-validation, our model was well-trained on the available variables and shows high accuracy for low-priced houses. However, it demonstrates less consistency and a slight positive spatial autocorrelation, suggesting the influence of unaccounted factors. At the urban scale, our model is less effective at predicting prices in low-income neighborhoods and reveals a significant discrepancy between high- and low-income areas, indicating a further development in generalizability.

4.1 Our Regression Model: explaining 76% of the prices

To improve the model’s accuracy and generalizability, we randomly split the dataset into 60% for training the model and 40% for testing its goodness of fit. Using the selected features, OLS regression was performed on the training set.

set.seed(34)
inTrain <- createDataPartition(
              y = paste(model_philly$general_construction, 
                       model_philly$topography,
                       model_philly$quality_grade,
                        model_philly$NAME),
                        
              p = .60, list = FALSE)
# [row,column] select all columns
philly.training <- model_philly[inTrain,] 
philly.test <- model_philly[-inTrain,]

reg.training <- lm(sale_price ~ log(total_livable_area+1) + total_livable_area + crime_nn5 + LPSS_PER1000 +
              log(dist1+1) +log(distance_to_city_hall+1) + log(distance_to_nearest_water+1) +
              Estimate_Mean.family.income..dollars. + Percent_White.alone + log(frontage+1) +  
              
              central_air + exterior_condition + fireplaces + garage_spaces + general_construction + 
              interior_condition + number_of_bathrooms + number_of_bedrooms + quality_grade + 
              separate_utilities + topography + building_code_description_new + zoning + 
              NAME,
    data = st_drop_geometry(philly.training))

kable(summary(reg.training)$coefficients, digits = 3,format = "html",caption = "<b>Table 4.1 OLS Regression Results</b>", align = "ccccc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover")) %>%
  scroll_box(height = "300px")
Table 4.1 OLS Regression Results
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2602553.344 159428.947 16.324 0.000
log(total_livable_area + 1) -53340.123 4227.568 -12.617 0.000
total_livable_area 203.405 3.891 52.272 0.000
crime_nn5 99.448 21.919 4.537 0.000
LPSS_PER1000 -355.719 62.060 -5.732 0.000
log(dist1 + 1) -311331.292 28622.655 -10.877 0.000
log(distance_to_city_hall + 1) 84882.204 23567.098 3.602 0.000
log(distance_to_nearest_water + 1) -4915.598 2243.416 -2.191 0.028
Estimate_Mean.family.income..dollars. 0.315 0.047 6.723 0.000
Percent_White.alone 310.545 112.103 2.770 0.006
log(frontage + 1) 32602.610 2292.147 14.224 0.000
central_airY 18501.777 2988.092 6.192 0.000
exterior_condition2 -53547.193 7303.736 -7.331 0.000
exterior_condition3 -68580.559 7492.304 -9.153 0.000
exterior_condition4 -91899.003 8674.360 -10.594 0.000
fireplaces1 67068.917 13085.724 5.125 0.000
garage_spaces1 15842.471 2864.147 5.531 0.000
garage_spaces2 78115.398 8498.819 9.191 0.000
general_constructionB -11270.490 3656.738 -3.082 0.002
general_constructionC 189225.510 20068.304 9.429 0.000
interior_condition2 -42242.750 3470.206 -12.173 0.000
interior_condition3 -56030.547 6584.426 -8.510 0.000
number_of_bathrooms1 -18625.215 6977.063 -2.669 0.008
number_of_bathrooms2 4351.026 7280.922 0.598 0.550
number_of_bathrooms3 55836.812 9239.940 6.043 0.000
number_of_bathrooms4 326526.524 16623.536 19.642 0.000
number_of_bedrooms1 22870.061 7307.230 3.130 0.002
number_of_bedrooms2 39637.760 6887.849 5.755 0.000
number_of_bedrooms3 15978.172 7511.094 2.127 0.033
number_of_bedrooms4 -14766.514 10373.662 -1.423 0.155
number_of_bedrooms5 -290198.487 27328.456 -10.619 0.000
quality_grade2 -64882.420 19616.440 -3.308 0.001
quality_grade3 -74504.377 19450.009 -3.831 0.000
separate_utilitiesB -38831.626 8023.355 -4.840 0.000
topographyB 239582.260 34833.986 6.878 0.000
building_code_description_new2 -15340.877 5805.798 -2.642 0.008
building_code_description_new3 -16182.590 5557.419 -2.912 0.004
building_code_description_new4 -19220.374 6137.199 -3.132 0.002
zoningRM2 7746.895 14385.569 0.539 0.590
zoningRM3 -9737.974 23710.849 -0.411 0.681
zoningRM4 -32379.920 16618.804 -1.948 0.051
zoningRMX1 63005.260 24701.659 2.551 0.011
zoningRMX2 72013.347 41015.004 1.756 0.079
zoningRMX3 -166525.907 13033.349 -12.777 0.000
zoningRSA1 -66667.854 24014.839 -2.776 0.006
zoningRSA2 -27635.835 9956.102 -2.776 0.006
zoningRSA3 -24865.513 6233.913 -3.989 0.000
zoningRSA4 -29916.407 15535.375 -1.926 0.054
zoningRSA5 2523.187 3963.322 0.637 0.524
zoningRSA6 30188.602 34770.908 0.868 0.385
zoningRSD1 -57179.970 16086.302 -3.555 0.000
zoningRSD2 -76399.095 24976.463 -3.059 0.002
zoningRSD3 -55811.352 10447.264 -5.342 0.000
zoningRTA1 -17908.382 17482.315 -1.024 0.306
NAMEALLEGHENY_WEST -29308.738 33195.930 -0.883 0.377
NAMEANDORRA -11064.666 30574.876 -0.362 0.717
NAMEASTON_WOODBRIDGE -4960.254 27215.358 -0.182 0.855
NAMEBARTRAM_VILLAGE 2026.783 44167.698 0.046 0.963
NAMEBELLA_VISTA -9689.404 39012.675 -0.248 0.804
NAMEBELMONT -96596.983 39534.258 -2.443 0.015
NAMEBLUE_BELL_HILL 23855.225 52093.575 0.458 0.647
NAMEBREWERYTOWN -66421.604 34047.828 -1.951 0.051
NAMEBRIDESBURG -33288.625 28495.489 -1.168 0.243
NAMEBURHOLME 22678.195 33311.026 0.681 0.496
NAMEBUSTLETON 9083.709 21455.471 0.423 0.672
NAMEBYBERRY -152303.892 50853.624 -2.995 0.003
NAMECALLOWHILL -57479.036 67376.749 -0.853 0.394
NAMECARROLL_PARK -13639.025 32370.249 -0.421 0.674
NAMECEDAR_PARK 91848.072 36103.192 2.544 0.011
NAMECEDARBROOK -29504.447 25464.674 -1.159 0.247
NAMECHESTNUT_HILL 192433.090 26600.252 7.234 0.000
NAMECLEARVIEW -2487.014 39341.471 -0.063 0.950
NAMECOBBS_CREEK 16924.000 31550.192 0.536 0.592
NAMECRESCENTVILLE -1060.511 57603.588 -0.018 0.985
NAMEDEARNLEY_PARK -57664.481 34975.202 -1.649 0.099
NAMEDICKINSON_NARROWS -61449.378 35922.057 -1.711 0.087
NAMEDUNLAP -54856.694 41086.849 -1.335 0.182
NAMEEAST_FALLS 99498.823 31849.632 3.124 0.002
NAMEEAST_KENSINGTON 50476.992 33709.002 1.497 0.134
NAMEEAST_OAK_LANE -75957.043 29019.226 -2.617 0.009
NAMEEAST_PARKSIDE -96268.424 39336.281 -2.447 0.014
NAMEEAST_PASSYUNK 33483.894 34657.539 0.966 0.334
NAMEEAST_POPLAR -106961.230 48732.852 -2.195 0.028
NAMEEASTWICK -622.977 34845.180 -0.018 0.986
NAMEELMWOOD 4793.280 31604.040 0.152 0.879
NAMEFAIRHILL -26062.385 38193.657 -0.682 0.495
NAMEFAIRMOUNT 45949.918 35408.313 1.298 0.194
NAMEFELTONVILLE -15557.392 28430.643 -0.547 0.584
NAMEFERN_ROCK -33460.333 31686.198 -1.056 0.291
NAMEFISHTOWN 55347.942 32925.212 1.681 0.093
NAMEFITLER_SQUARE 354300.464 43313.640 8.180 0.000
NAMEFOX_CHASE 18939.843 22049.062 0.859 0.390
NAMEFRANCISVILLE -93953.052 37651.349 -2.495 0.013
NAMEFRANKFORD -55434.931 25885.719 -2.142 0.032
NAMEFRANKLINVILLE -20779.315 36170.837 -0.574 0.566
NAMEGARDEN_COURT 137900.477 42495.015 3.245 0.001
NAMEGERMANTOWN_EAST -42393.000 26355.213 -1.609 0.108
NAMEGERMANTOWN_MORTON -30698.933 29280.182 -1.048 0.294
NAMEGERMANTOWN_PENN_KNOX 9668.388 43071.006 0.224 0.822
NAMEGERMANTOWN_SOUTHWEST -32441.731 31351.090 -1.035 0.301
NAMEGERMANTOWN_WEST_CENT -23829.366 32415.715 -0.735 0.462
NAMEGERMANTOWN_WESTSIDE -15048.286 38166.730 -0.394 0.693
NAMEGERMANY_HILL 29758.118 29312.835 1.015 0.310
NAMEGIRARD_ESTATES 19701.931 33865.308 0.582 0.561
NAMEGLENWOOD -50566.283 38874.899 -1.301 0.193
NAMEGRADUATE_HOSPITAL -21343.972 37725.365 -0.566 0.572
NAMEGRAYS_FERRY -86619.740 33827.883 -2.561 0.010
NAMEGREENWICH -19475.542 37230.024 -0.523 0.601
NAMEHADDINGTON -9678.776 31976.804 -0.303 0.762
NAMEHARROWGATE -17226.857 30729.207 -0.561 0.575
NAMEHARTRANFT -76415.644 33764.438 -2.263 0.024
NAMEHAVERFORD_NORTH -119916.837 44240.581 -2.711 0.007
NAMEHAWTHORNE -65396.685 42714.996 -1.531 0.126
NAMEHOLMESBURG -36366.901 21898.495 -1.661 0.097
NAMEHUNTING_PARK -15837.410 30663.227 -0.516 0.606
NAMEJUNIATA_PARK -9587.595 28742.008 -0.334 0.739
NAMEKINGSESSING -16713.237 32894.986 -0.508 0.611
NAMELAWNDALE -23342.252 24375.594 -0.958 0.338
NAMELEXINGTON_PARK 47580.634 29014.632 1.640 0.101
NAMELOGAN -50539.356 28168.190 -1.794 0.073
NAMELOGAN_SQUARE 96881.443 41583.433 2.330 0.020
NAMELOWER_MOYAMENSING -14796.671 33166.742 -0.446 0.656
NAMELUDLOW -126672.567 44551.269 -2.843 0.004
NAMEMANAYUNK 23543.315 26340.398 0.894 0.371
NAMEMANTUA -100955.282 37763.526 -2.673 0.008
NAMEMAYFAIR -3113.029 22504.502 -0.138 0.890
NAMEMCGUIRE -20470.241 40807.442 -0.502 0.616
NAMEMELROSE_PARK_GARDENS -49188.376 32788.425 -1.500 0.134
NAMEMILL_CREEK -28154.092 34665.280 -0.812 0.417
NAMEMILLBROOK -32022.950 28539.903 -1.122 0.262
NAMEMODENA -25566.749 24704.705 -1.035 0.301
NAMEMORRELL_PARK -20171.705 26015.772 -0.775 0.438
NAMEMOUNT_AIRY_EAST -13797.497 24467.224 -0.564 0.573
NAMEMOUNT_AIRY_WEST -3215.829 26921.057 -0.119 0.905
NAMENEWBOLD -33112.641 35184.325 -0.941 0.347
NAMENICETOWN -4117.793 35047.196 -0.117 0.906
NAMENORMANDY_VILLAGE 10148.206 43143.818 0.235 0.814
NAMENORTH_CENTRAL -110199.533 34929.846 -3.155 0.002
NAMENORTHERN_LIBERTIES -54988.249 36717.795 -1.498 0.134
NAMENORTHWOOD -39679.709 27897.136 -1.422 0.155
NAMEOGONTZ -37362.801 26169.313 -1.428 0.153
NAMEOLD_KENSINGTON -21715.145 35840.606 -0.606 0.545
NAMEOLNEY -35809.804 25780.602 -1.389 0.165
NAMEOTHERS 159496.761 47821.591 3.335 0.001
NAMEOVERBROOK -7237.417 28863.711 -0.251 0.802
NAMEOXFORD_CIRCLE -11048.623 22848.657 -0.484 0.629
NAMEPACKER_PARK 36516.927 38251.130 0.955 0.340
NAMEPARKWOOD_MANOR -29901.403 26079.668 -1.147 0.252
NAMEPASCHALL -17043.822 31813.161 -0.536 0.592
NAMEPASSYUNK_SQUARE -24210.496 36061.728 -0.671 0.502
NAMEPENNSPORT -27408.722 35062.664 -0.782 0.434
NAMEPENNYPACK 16373.339 23556.723 0.695 0.487
NAMEPENNYPACK_WOODS -32622.449 31699.974 -1.029 0.303
NAMEPENROSE 4735.753 34266.872 0.138 0.890
NAMEPOINT_BREEZE -92744.583 34518.047 -2.687 0.007
NAMEPOWELTON 70861.367 68260.900 1.038 0.299
NAMEQUEEN_VILLAGE 42061.360 37537.584 1.121 0.263
NAMERHAWNHURST 31930.280 21888.038 1.459 0.145
NAMERICHMOND 41300.018 31596.752 1.307 0.191
NAMERITTENHOUSE 225261.447 42233.350 5.334 0.000
NAMERIVERFRONT -18817.788 59635.268 -0.316 0.752
NAMEROXBOROUGH 37129.366 26384.156 1.407 0.159
NAMEROXBOROUGH_PARK 23579.964 38195.037 0.617 0.537
NAMESHARSWOOD -76375.672 45785.346 -1.668 0.095
NAMESOCIETY_HILL 157827.536 39694.403 3.976 0.000
NAMESOMERTON 2882.903 21870.934 0.132 0.895
NAMESOUTHWEST_SCHUYLKILL -15319.075 34234.156 -0.447 0.655
NAMESPRING_GARDEN -22094.949 39599.016 -0.558 0.577
NAMESPRUCE_HILL 219723.879 40576.763 5.415 0.000
NAMESTADIUM_DISTRICT 47099.103 35656.531 1.321 0.187
NAMESTANTON -105770.026 33704.611 -3.138 0.002
NAMESTRAWBERRY_MANSION -82565.409 33359.539 -2.475 0.013
NAMESUMMERDALE -21538.879 27240.615 -0.791 0.429
NAMETACONY -38871.002 22737.482 -1.710 0.087
NAMETIOGA -44156.582 32954.713 -1.340 0.180
NAMETORRESDALE -12148.653 23285.266 -0.522 0.602
NAMEUPPER_KENSINGTON -21619.929 31866.999 -0.678 0.498
NAMEUPPER_ROXBOROUGH -255.324 24484.536 -0.010 0.992
NAMEWALNUT_HILL 66325.715 39412.958 1.683 0.092
NAMEWASHINGTON_SQUARE -38814.789 44051.355 -0.881 0.378
NAMEWEST_KENSINGTON -21448.120 33916.230 -0.632 0.527
NAMEWEST_OAK_LANE -38806.419 23723.676 -1.636 0.102
NAMEWEST_PARKSIDE -22817.113 57916.022 -0.394 0.694
NAMEWEST_PASSYUNK -57094.888 34254.726 -1.667 0.096
NAMEWEST_POPLAR -149198.102 49603.709 -3.008 0.003
NAMEWEST_POWELTON -72672.246 40909.125 -1.776 0.076
NAMEWHITMAN -26867.258 34173.513 -0.786 0.432
NAMEWINCHESTER_PARK -10616.927 33056.006 -0.321 0.748
NAMEWISSAHICKON 32862.285 30044.667 1.094 0.274
NAMEWISSAHICKON_HILLS 40770.009 35614.181 1.145 0.252
NAMEWISSINOMING -27769.979 23659.387 -1.174 0.241
NAMEWISTER -48071.736 31639.190 -1.519 0.129
NAMEWYNNEFIELD -31312.245 32489.813 -0.964 0.335
NAMEWYNNEFIELD_HEIGHTS 87649.238 39433.444 2.223 0.026
NAMEYORKTOWN -120781.881 47932.988 -2.520 0.012

In our model, the features explain approximately 76% of the variation in price (Adjusted R2 = 75.73%) and most of their coefficients are statistically significant (p<0.05). Several features, such as general construction and architectural styles, are converted to categorical variables, allowing their coefficients to be estimated and linked to housing prices. The remaining insignificant coefficients are of neighborhood names or zoning codes, which are less likely to regroup due to spatial disparities.

4.2 Accuracy: better for sales priced below $1,000,000

To dictate how useful the model is for decision making, we used testing set to analyse goodness of fit indicators.

philly.test <-
  philly.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.training, philly.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
  filter(sale_price < 5000000)

MAE <- mean(philly.test$sale_price.AbsError, na.rm = T)
MAPE <- mean(philly.test$sale_price.APE, na.rm = T)
acc <- data.frame(MAE, MAPE)

acc %>%
  knitr::kable(caption = "<b>Table 4.2 MAE and MAPE of the testing dataset</b>", align = "cc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover"))
Table 4.2 MAE and MAPE of the testing dataset
MAE MAPE
65671.32 0.3592045

The Mean Absolute Error (MAE) is $65671.32, representing the average difference between predicted and observed prices. The relatively high value reflects the diversity in housing prices, suggesting a limitation for linear regression. The Mean Absolute Percent Error (MAPE) indicates that the prediction error amounts to 35.92% of the housing prices on average.

ggplot(
  philly.test, aes(sale_price.Predict, sale_price)) +
  geom_point(size = .5) + 
  geom_smooth(method = "lm", se=F, colour = "#6D9EC1") +
  geom_abline(intercept = 0, slope = 1, color="#E46726",size=1) +
  theme_minimal() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = ggtext::element_markdown(size = 12),
        plot.caption = element_text(hjust = 0)) +
  labs(title = 'Predicted sale price as a function of observed price', 
       subtitle = glue("<span style='color:#E46726;'>Perfect prediction</span> vs. <span style='color:#6D9EC1;'>Average prediction</span>"),caption = "Figure 4.1",
       x = "Predicted Sale Price",
       y = "Observed Sale Price")

Given the high MAE and MAPE, data visualization was used to compare predicted and observed prices. The red and blue lines nearly overlap, indicating the model performs well overall. However, the slight deviation suggests that, on average, the model’s predictions are a little higher than the observed prices. Examining the data points further reveals that prediction accuracy decreases as prices rise, confirming our concerns about the variability in the higher price ranges of the dataset. This suggests that our model performs particularly well for lower-valued housing prices.

4.3 Generalizability: cross-validation shows less consistency

After making predictions on a single hold-out test set, we partitioned the dataset into 100 equal-sized subsets, training on the remaining data and testing on each subset. This process is known as the K-fold cross-validation method.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(sale_price ~ log(total_livable_area+1) + total_livable_area + crime_nn5 + LPSS_PER1000 +
              log(dist1+1) +log(distance_to_city_hall+1) + log(distance_to_nearest_water+1) +
              Estimate_Mean.family.income..dollars. + Percent_White.alone + log(frontage+1) +  
              
              central_air + exterior_condition + fireplaces + garage_spaces + general_construction + 
              interior_condition + number_of_bathrooms + number_of_bedrooms + quality_grade + 
              separate_utilities + topography + building_code_description_new + zoning + 
              NAME,
    
    data = st_drop_geometry(model_philly),
    method = "lm", trControl = fitControl, na.action = na.pass)

kable(reg.cv$results, digits = 3,caption = "<b>Table 4.3 Cross-Validation Results</b>", align = "ccccccc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover"))
Table 4.3 Cross-Validation Results
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 115492.5 0.765 65851.93 37687.33 0.07 6719.758
Mean <- mean(reg.cv$resample[,3])
SD <- sd(reg.cv$resample[,3])
stat_MAE <- data.frame(Mean, SD)

stat_MAE %>%
  knitr::kable(caption = "<b>Table 4.4 Mean and standard deviation of the cross-validation MAE</b>", align = "cc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover"))
Table 4.4 Mean and standard deviation of the cross-validation MAE
Mean SD
65851.93 6719.758

The standard deviation of MAE (6719.76) reflects significant variation across the 100 folds. Additionally, the mean of MAE in cross-validation (65851.93) is quite similar to the MAE in our test set (65671.32), indicating that the model was well-trained on the existing variables and therefore shows an average error.

ggplot(data = data.frame(mae = reg.cv$resample[,3]),aes(x = mae)) +
  geom_histogram(color = "white", fill = "#6D9EC1") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 50),
                     labels = function(x) ifelse(x %% 5000 == 0, x, "")) +
  labs(title = "Distribution of MAE", 
       subtitle = "K fold cross validation; k=100.",
       caption = "Figure 4.2",
       x = "Mean Absolute Error", y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(hjust = 0)) +
  geom_vline(aes(xintercept=mean(mae, na.rm=T)),
               color="#E46726", linetype="dashed", size=1) +
  geom_vline(aes(xintercept= MAE),
               color="#E46726", linetype="dashed", size=.8, alpha = .8) +
  annotate(geom = "text", label = paste("Mean of MAE:", round(Mean, 2)), # Insert correlation here
    x = 67000, y = 10, 
    hjust = 0, size = 4, color = "#E46726") +
  annotate(geom = "text", label = paste("MAE in the test set:\n", round(MAE, 2)), # Insert correlation here
    x = 61000, y = 9, 
    hjust = 1, size = 4, color = "#E46726", alpha = .8) 

Visualized in a histogram, the distribution of MAE in cross-validation peaks around 63000 and exhibits a long right tail. This suggests that the model predicts inconsistently and may be unreliable for predicting houses that have not been sold recently.

4.4 Generalizability: positive spatial autocorrelation exists

philly.test_predict <-
  philly.test[which(philly.test$sale_price.Error != 0),]

coords.test <-  st_coordinates(philly.test_predict) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")
philly.test_predict$lagPriceError <- lag.listw(spatialWeights.test, philly.test_predict$sale_price.Error)

philly.test <- philly.test_predict

ggplot() +
  geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = T) +
  geom_sf(data = philly.test_predict, aes(colour = q5(sale_price.Error), na.rm = T),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(philly.test_predict,"sale_price.Error"),
                   name="Quintile Breaks") +
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(title="Sale price errors on the test set",
       caption = "Figure 4.3") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

What might be causing the high MAE? Although our model incorporates internal characteristics, public services/amenities, and some spatial features, other factors influencing the spatial distribution of home prices remain unexplored. As the map illustrates, sale price errors in Philadelphia are generally randomly distributed, but with some clustering both within and across neighborhoods. How much do our model errors cluster? Calculating the average weighted model error of its five nearest neighbors as the “spatial lag”, we delved deeper into the spatial autocorrelation test.

cor_error <- cor(philly.test_predict$lagPriceError, philly.test_predict$sale_price.Error, method = "pearson")
ggplot(philly.test_predict, aes(x = lagPriceError, y = sale_price.Error))+
     geom_point(size = .5, color = "#6D9EC1") + geom_smooth(method = "lm", se=F, colour = "#E46726") +
     labs(title = "Error as a function of the spatial lag of price",
          x = "Spatial lag of errors (Mean error of 5 nearest neighbors)", y = "Sale price errors",
       caption = "Figure 4.4") +
  annotate(geom = "text", label = paste("Pearson Correlation:", round(cor_error, 3)), # Insert correlation here
    x = -700000, y = -1000000, 
    hjust = 0, vjust = 2, 
    size = 4) +
     theme_minimal() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.caption = element_text(hjust = 0))

Comparing the spatial lag with the actual values helps estimate the degree of spatial autocorrelation. The relationship visualized above shows that as home price errors increase, nearby home price errors tend to rise as well. The correlation is 0.225, which is marginal but significant.

philly.test_nonzero <- philly.test %>%
  filter(sale_price.Error != 0)
moranTest <- moran.mc(philly.test_nonzero$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#E46726",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="<span style='color:#E46726;'>Observed</span> and <span style='color:#7F7F7F;'>permuted</span> Moran's I",
       x="Moran's I",
       y="Count",
       caption = "Figure 4.5") +
  annotate(geom = "text", label = paste("Statistic:", round(moranTest$statistic,3),"\np-value:", moranTest$p.value), # Insert correlation here
    x = 0.2, y = 300, 
    hjust = 0, 
    size = 4) +
  theme_minimal() +
  theme(plot.title = ggtext::element_markdown(size = 18, face = "bold"),
        plot.caption = element_text(hjust = 0))

Another approach to measure spatial autocorrelation is Moran’s I, where a positive value close to one indicates strong positive spatial autocorrelation. The histogram above shows 999 randomly permuted I values, with the observed I marked by the orange line. The observed I, higher than all random permutations, confirms spatial autocorrelation. An I of 0.117 seems marginal, but a p-value of 0.001 still indicates statistically significant clustering.

Both the spatial lag and Moran’s I test show that our model errors exhibit a slight positive spatial autocorrelation, suggesting the presence of unaccounted factors.

4.5 Generalizability: less effective for low-priced neighborhoods and low-income tracts

How does the spatial autocorrelation manifest in our model? From the perspective of geospatial features, including neighborhoods and median income, we further analysed the spatial process of sale price predictions and tested the generalizability across urban contexts.

nhoods_model <- st_drop_geometry(philly.test) %>%
  group_by(NAME) %>%
  summarise(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
  ungroup() %>%
  left_join(nhoods) %>%
  st_sf()

ggplot() +
  geom_sf(data = nhoods, fill = "transparent", color = "grey") +
  geom_sf(data = nhoods_model, aes(fill = mean.MAPE), color = "grey") +
  scale_fill_gradient(low = "#6D9EC1", high = "#E46726", name = "MAPE") +
  labs(title = "Mean test set MAPE by neighborhood",
       caption = "Figure 4.6") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

The MAPE by neighborhood reveals significantly less accurate predictions in North and West Philadelphia (highlighted in red). The vacant areas have no housing sales data, including northern Center City, University City, Northeast Philadelphia Airport and its surroundings, southern South Philadelphia, and parks. In North Philadelphia, the red areas are predominantly home to Hispanic and Black residents, while the surrounding population is more diverse, contributing to the high MAPE. In West Philadelphia, the impact of the University of Pennsylvania’s role in gentrification in the east and rising crime in the west have led to deteriorating areas with high MAPE. Both red areas also have lower sale prices, confirming the spatial pattern of model generalizability.

nhoods_model <- left_join(nhoods_model,
  st_drop_geometry(philly.test) %>%
    group_by(NAME) %>%
    summarise(meanPrice = mean(sale_price, na.rm = T))
) 
ggplot(data = st_drop_geometry(nhoods_model)) +
  geom_point(aes(x = meanPrice, y = mean.MAPE), size = 2, color = "#E46726", alpha = 0.7) +
  geom_vline(aes(xintercept= 375000),
               color="#6D9EC1", linetype="dashed", size=.8, alpha = .8) +
  geom_hline(aes(yintercept= .4),
               color="#6D9EC1", linetype="dashed", size=.8, alpha = .8) +
  labs(title = "MAPE as a function of mean price by neighborhood",
       x = "Mean Price by Neighborhood",
       y = "MAPE by Neighborhood",
       caption = "Figure 4.7") +
  theme_minimal() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.caption = element_text(hjust = 0)) +
  annotate(geom = "text", label = paste("(275000,0.4)"),
    x = 275000, y = 0.4, 
    hjust = -.8, vjust = -1, size = 4, color = "#6D9EC1", alpha = .8)

Upon further examining the relationship between price and MAPE across neighborhoods, it is evident that MAPE decreases as the mean price increases up to $375,000, but starts to rise when the mean price exceeds this threshold. In our model, mean prices between $125,000 and $500,000 are associated with relatively lower MAPE. The spatial pattern of MAPE (see Figure 4.6) also shows that lower-priced, declining areas tend to have higher MAPE. These two graphs highlight that our model is more effective in neighborhoods with mid-range prices and that the socioeconomic context of neighborhoods impacts pricing.

ggplot() + 
  geom_sf(data = tracts22, fill = "transparent", color = "grey") +
  geom_sf(data = na.omit(tracts22), aes(fill = incomeContext), color = "grey90") +
  scale_fill_manual(values = c("#E46726", "#6D9EC1"), name="Income Context") +
  labs(title = "Tracts grouped by income",
       subtitle = "Philadelphia, 2022",
       caption = "Figure 4.8") +
  mapTheme() + 
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

income_table <- st_join(philly.test, tracts22) %>%
  group_by(incomeContext) %>%
  summarise(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  dplyr::select(-"<NA>") %>%
  mutate('No Context' = scales::percent(MAPE))

income_table %>%
  knitr::kable(caption = "<b>Table 4.5 Test set MAPE by tract income context</b>", align = "ccc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover"))
Table 4.5 Test set MAPE by tract income context
High Income Low Income No Context
24% 43% 36%

At the urban scale, we compared MAPE across different income contexts to assess the generalizability of our model. Tracts with median incomes above the citywide mean were designated as ‘High Income’, effectively dividing Philadelphia into two categories. The nearly 1:2 ratio in MAPE highlights a significant difference in the model’s goodness of fit, indicating higher accuracy in predicting housing prices in high-income areas. The neighborhoods with high MAPE (see Figure 4.6) fall predominantly within the low-income group, suggesting a connection between race, crime, income, and housing prices.

4.6 Challenge time

Using our model, the predicted values for where ‘toPredict’ is both “MODELLING” and “CHALLENGE” are as follows.

philly_finish <- philly %>%
   mutate(sale_price.Predict = predict(reg.training, philly))

ggplot() +
  geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = T) +
  geom_sf(data = philly_finish %>% filter(!is.na(sale_price.Predict)), aes(colour = q5(sale_price.Predict), na.rm = T),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(philly_finish,"sale_price.Predict"),
                   name="Quintile Breaks") +
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(title="Predicted Values of 'Modelling' and 'Challenge'",
       subtitle = "Philadelphia, 2022",
       caption = "Figure 4.9") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

5 Discussion

Our model effectively predicts 76% of the variation in housing prices, particularly with high accuracy for sales priced below $1,000,000. Our logarithmic transformation and piecewise splitting of distance to city hall is both interesting and essential to the model. Other key features such as distance to the city center, neighborhood names, and zoning numbers significantly reduce prediction errors by approximately 20% when included in training. The Mean Absolute Error is $65671.32 and the Mean Absolute Percent Error is 35.92%. Prediction errors remain high due to the dataset’s wide price variation and unaccounted factors influencing spatial dynamics. From the perspective of urban context, the model performs poorly in North Philadelphia, where clustering of Hispanic and Black residents presents challenges, and in West Philadelphia, where rising crime and gentrification add complexity. These decaying factors contribute to lower housing prices and a more intricate spatial process. Nonetheless, the model performs well in neighborhoods with mid-range housing prices, where the larger dataset provides stronger training and testing opportunities.

6 Conclusion

I would recommend our model to Zillow for customers looking to predict mid-range housing prices in Philadelphia. The model is well-trained and accounts for internal characteristics, public services/amenities, and spatial features, demonstrating a strong fit within the context of Philadelphia. While there is some spatial autocorrelation present and OLS regression has its limitations, we plan to explore additional socioeconomic factors that influence the spatial process of housing prices, and apply non-linear regression and other machine learning techniques to further enhance the model.

References

Aziz, Asad, Muhammad Mushahid Anwar, Hazem Ghassan Abdo, Hussein Almohamad, Ahmed Abdullah Al Dughairi, and Motrih Al-Mutiry. 2023. “Proximity to Neighborhood Services and Property Values in Urban Area: An Evaluation Through the Hedonic Pricing Model.” Land 12 (4): 859.
Li, Sheng, Yi Jiang, Shuisong Ke, Ke Nie, and Chao Wu. 2021. “Understanding the Effects of Influential Factors on Housing Prices by Combining Extreme Gradient Boosting and a Hedonic Price Model (XGBoost-HPM).” Land 10 (5): 533.
Lin, Yun-Wei, Yi-Bing Lin, Chun-You Liu, Jiun-Yi Lin, and Yu-Lin Shih. 2019. “Implementing AI as Cyber IoT Devices: The House Valuation Example.” IEEE Transactions on Industrial Informatics 16 (4): 2612–20.
Mayor, Karen, Seán Lyons, David Duffy, and Richard SJ Tol. 2009. “A Hedonic Analysis of the Value of Parks and Green Spaces in the Dublin Area.” ESRI working paper.
Scotchmer, Suzanne. 1985. “Hedonic Prices and Cost/Benefit Analysis.” Journal of Economic Theory 37 (1): 55–75.
Zaki, John, Anand Nayyar, Surjeet Dalal, and Zainab H Ali. 2022. “House Price Prediction Using Hedonic Pricing Model and Machine Learning Techniques.” Concurrency and Computation: Practice and Experience 34 (27): e7342.