knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, cache = FALSE, fig.show = "asis",
    fig.align = "center") 
options(scipen=999)

# Load some libraries
library(tidyverse)
library(tidycensus)
library(stringr)
library(dplyr)
library(lubridate)
library(sf)
library(stargazer)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(corrr)      
library(kableExtra)
library(jtools)     
library(ggstance) 
library(ggpubr)    
library(broom.mixed) 
library(glue)
library(classInt)
library(viridis)
library(patchwork)
library(corrplot)


source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- colorRampPalette(c("#6D9EC1", "#E46726"))(5)

1 Introduction

An improved prediction model for home prices helps stakeholders and government agencies in making better housing policy and marketing decisions. For Zillow, one of the leading real estate companies in United States, developing an accurate, generalizable model for home prices is crucial to providing effective services to customers and enhancing the company’s competitiveness in the industry. However, this is a challenging task as numerous factors affect home prices, and no single model could cover all of them. This study seeks to create a more comprehensive home price model for Philadelphia, aiming to increase both accuracy and generalizability to benefit homebuyers, investors, the company and policymakers.

To achieve this, we applied the Hedonic Pricing Model, which accounts for internal characteristics, amenities/public services and spatial structure to determine property prices. Ordinary Least Square (OLS) regression forms the core of the model, following data cleaning, filtering and reclassification. Tests are conducted to evaluate the accuracy and generalizability of the model.

Overall, our model takes a step forward by including local characteristics like citywide spatial structures and neighborhood contexts, effectively predicts 76% of the variation in housing prices, achieves high accuracy for sales priced below $1,000,000, and controls mean absolute percentage error (MAPE) to about 36% without introducing more complex algorithms. However, the model shows less consistency and slight spatial autocorrelation, suggesting unaccounted factors. At the urban scale, North and West Philadelphia demonstrate poor predictions, indicating less generalizability towards low-priced neighborhoods and low-income tracts. Our model could be applied for predicting mid-range housing prices. Additional socioeconomic factors and non-linear regression techniques could be explored to develop the model.

2 Data

Before modeling, we processed the data through data gathering from open portals, exploratory data analysis, and feature engineering.

2.1 Data Collection

We reviewed previous studies to identify possible explanatory variables. Following the Hedonic Pricing Model, we sourced internal characteristics from the original dataset, gathered data on amenities (access to various spaces and services) through Open Data Philly portal, and accessed census data to grasp local contexts. Both continuous and categorial data are incorporated, and we chose the average distance to the 5 nearest crimes as an indicator of crime rate after comparing various metrics.(Zaki et al. 2022; Lin et al. 2019; Li et al. 2021; Aziz et al. 2023)

Data Sources: Philly neighborhoods, housing sale prices (studentData), crime incidents, hydrology, food retail access, zoning, census tables

# philly$crimes.Buffer <- philly %>% 
#     st_buffer(660) %>% 
#     aggregate(mutate(phillyCrimes, counter = 1),., sum) %>%
#     pull(counter)
philly <-
  philly %>% 
    mutate(
      # crime_nn1 = nn_function(st_coordinates(philly), 
      #                         st_coordinates(phillyCrimes), k = 1),
      # 
      # crime_nn2 = nn_function(st_coordinates(philly), 
      #                         st_coordinates(phillyCrimes), k = 2), 
      # 
      # crime_nn3 = nn_function(st_coordinates(philly), 
      #                         st_coordinates(phillyCrimes), k = 3), 
      # 
      # crime_nn4 = nn_function(st_coordinates(philly), 
      #                         st_coordinates(phillyCrimes), k = 4), 
      
      crime_nn5 = nn_function(st_coordinates(philly), 
                              st_coordinates(phillyCrimes), k = 5))

# city_hall <- metro[metro$Station == 'City Hall', 6]
# school_elem <- school %>%
#   filter(str_detect(GRADE_LEVEL, "ELEMENTARY"))
# 
# census <- census %>%
#   mutate(across(3:8, ~ as.numeric(as.character(.)))) %>%
#   mutate(NAMELSAD = str_extract(NAME, "^[^;]+")) %>%
#   select(NAMELSAD, 3:8)
# 
# calculate_nearest_distance <- function(set_points, other_layer) {
#   nearest_idx <- st_nearest_feature(set_points, other_layer)
#   st_distance(set_points, other_layer[nearest_idx, ], by_element = TRUE) %>% as.numeric()
# }
# set2 <- studentData[,74:76] %>%
#   mutate(distance_to_city_hall = st_distance(., city_hall) %>% as.numeric()) %>%
#   mutate(
#     distance_to_nearest_metro = calculate_nearest_distance(geometry, metro),
#     distance_to_nearest_hospital = calculate_nearest_distance(geometry, hospital),
#     distance_to_nearest_school = calculate_nearest_distance(geometry, school_elem),
#     distance_to_nearest_park = calculate_nearest_distance(geometry, park),
#     distance_to_nearest_water = calculate_nearest_distance(geometry, water)
#   )
# set2 <- set %>%
#   st_join(census_tract["NAMELSAD"]) %>%
#   left_join(census, by = "NAMELSAD") %>% 
#   st_join(retail[, c("LPSS_PER1000", "HPSS_PER1000")])

set <- read.csv("C:/Data/Fall 2024/CPLN 5920/Code/PPA/Predicting Housing Prices/data/set.csv")
philly_set <- merge(philly, set, by = "musaID", all.x = FALSE, all.y = FALSE, sort = FALSE) %>%
  dplyr::select(-toPredict.y) %>%
  rename(toPredict = toPredict.x) %>%
  mutate(dist1 = ifelse(distance_to_city_hall <= 20000, distance_to_city_hall, 20000))
philly <- philly_set %>% select(-distance_to_nearest_metro, -distance_to_nearest_hospital, -distance_to_nearest_school, -distance_to_nearest_park, -NAMELSAD, -Percent_Bachelor.s.degree.or.higher, -Percent_Management..business..science..and.arts.occupations, -Percent_Different.state, -Percent_With.private.health.insurance, -HPSS_PER1000)

2.2 Exploratory Data Analysis

model_philly <- philly %>%
  filter(toPredict == "MODELLING")

ggplot() +
  geom_sf(data = nhoods, fill = "grey40", color = "grey") +
  geom_sf(data = model_philly, aes(colour = q5(sale_price)),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels = qBr(model_philly, "sale_price"),
                   name="Quintile Breaks") +
  guides(color = guide_legend(override.aes = list(size = 3))) + 
  labs(title="Sale Price",
       subtitle = "Philadelphia, 2022",
       caption = "Figure 2.1") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

Sale price of homes is the dependent variable to be predicted, and Figure 2.1 showed its spatial distribution across Philadelphia. Higher prices cluster in center city and the far northeast and northwest suburbs, while lower prices concentrate in west Philadelphia and north Philadelphia. This concentric pattern reflects how the city has developed, where the wealthier residents moved to the suburbs, leaving the middle ring less developed.

stat_philly <- philly %>%
  dplyr::select(-musaID, -sale_price)

stat_philly_data <- st_drop_geometry(stat_philly)

descriptions$N <- sapply(descriptions$Field.Name, function(field) {
  sum(!is.na(stat_philly_data[[field]]))
})

selected_vars <- descriptions$Field.Name[descriptions$Variable.Type %in% c("Categorical*", "Continuous")]

descriptions <- descriptions %>% 
  rowwise() %>%
  mutate(
    Mean = ifelse(Field.Name %in% selected_vars, round(mean(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA),
    StDev = ifelse(Field.Name %in% selected_vars, round(sd(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA),
    Min = ifelse(Field.Name %in% selected_vars, round(min(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA),
    Max = ifelse(Field.Name %in% selected_vars, round(max(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA)
  ) %>%
  ungroup()

descriptions %>%
  knitr::kable(caption = "<b>Table2.1 Summary Statistics</b>", align = "ccclccccc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover")) %>%
  scroll_box(height = "300px", width = "800px")

**Table2.1 Summary Statistics**
Category	Field.Name	Variable.Type	Description	N	Mean	StDev	Min	Max
Internal	building_code_description_new	Categorical	Building style.	23881	NA	NA	NA	NA
Internal	central_air	Categorical	“Y” indicates there is central air.	23881	NA	NA	NA	NA
Internal	general_construction	Categorical	General construction.	23881	NA	NA	NA	NA
Internal	quality_grade	Categorical	Quality grade related to building workmanship and materials.	23881	NA	NA	NA	NA
Internal	separate_utilities	Categorical	Whether heater, hot water tank, electrical service, and gas service are central, part separate, or separate.	23881	NA	NA	NA	NA
Internal	topography	Categorical	Above street level, below street level, flood plain, rocky, or other.	23881	NA	NA	NA	NA
Internal	zoning	Categorical	A code identifying the legal uses permitted on the property.	23881	NA	NA	NA	NA
Internal	exterior_condition	Categorical*	Indicates the observed exterior condition of the property.	23880	3.764	0.885	1.000	7.000
Internal	fireplaces	Categorical*	Indicates the number of fireplaces.	23673	0.048	0.291	0.000	5.000
Internal	garage_spaces	Categorical*	Indicates the number of garage spaces.	23457	0.357	0.705	0.000	72.000
Internal	interior_condition	Categorical*	Represents the overall condition of the interior.	23873	3.647	0.890	0.000	7.000
Internal	number_of_bathrooms	Categorical*	Number of bathrooms.	23622	1.072	0.676	0.000	8.000
Internal	number_of_bedrooms	Categorical*	Number of bedrooms.	23865	2.583	1.266	0.000	31.000
Internal	frontage	Continuous	The width of the lot where it abuts the principal street.	23751	19.653	19.419	0.000	1307.000
Internal	total_livable_area	Continuous	Total livable area.	23881	1333.633	548.834	0.000	15246.000
Amenities	crime_nn5	Continuous	Average distance to the 5 nearest crimes (in ft) from the site.	23881	142.530	73.134	18.440	1772.141
Amenities	dist1	Continuous	Distance to city hall (in ft) from the site if less than 20000 ft, otherwise 20000 ft.	23881	16519.527	5196.091	1400.739	20000.000
Amenities	distance_to_city_hall	Continuous	Distance to city hall (in ft) from the site.	23881	27504.777	17484.769	1400.739	79657.714
Amenities	distance_to_nearest_water	Continuous	Distance to the nearest water body (in ft) from the site.	23881	3436.259	2039.120	19.850	9589.470
Amenities	LPSS_PER1000	Continuous	Number of low-produce supply stores per 1000 people.	23879	28.196	24.076	0.000	224.615
Spatial	NAME	Categorical	Name of the neighborhood.	23881	NA	NA	NA	NA
Spatial	Estimate_Mean.family.income..dollars.	Continuous	Mean family income (in dollars) in the census tract where the site is located.	23881	103220.610	61176.718	25294.000	427084.000
Spatial	Percent_White.alone	Continuous	Percentage of white (non-Hispanic) population in the census tract where the site is located.	23881	37.359	30.115	0.000	95.400

*Note: Transformed from continuous into categorical in the regression model.

The Hedonic Pricing Model divides the total cost of property into separate costs for each characteristic according to different subgroups.(Mayor et al. 2009; Scotchmer 1985) In this study, twenty-two variables are selected for the model, as summarized in Table 2.1. These include internal features like total livable area or exterior condition, amenities like crime and distance to nearest water body, as well as spatial contexts like mean family income in the census tract. The dataset’s average home has 1,333 square feet of livable space, 2.58 bedrooms and 1.07 bathrooms, with its cost at about $270,000.

numericVars <- 
  select_if(st_drop_geometry(model_philly), is.numeric) %>% na.omit() %>%
  dplyr::select(-musaID)

cor_matrix <- cor(numericVars, use = "complete.obs")

corrplot(
  cor_matrix,
  method = "color",                             
  col = colorRampPalette(c("#6D9EC1", "white", "#E46726"))(200),  
  type = "lower",                               
  tl.pos = "l",                                 
  tl.col = "black",                      
  tl.cex = 0.7,                           
  mar = c(0, 0, 3, 0)                           
)


title(main = "Correlation across numeric variables", cex.main = 1.5, font.main = 2)
mtext("Figure 2.2", side = 1, line = 4, adj = 0.2, cex = .8)

Correlation of sale price and other numeric variables are presented in Figure 2.2. Mean family income shows the strongest correlation with sale price (coefficient = 0.645), followed by total livable area. Exterior and interior conditions have the highest negative correlations. Some variables exhibit high autocorrelation, which may affect interpretability of the coefficients in the final model.

# ggplot(gather(numericVars), aes(value)) +
#   geom_histogram(bins = 50) +
#   facet_wrap(~key,nrow=4, scales = 'free_x') +
#   theme_minimal() +
#   theme(axis.text = element_blank(), axis.ticks = element_blank())

# all numeric variables
# model_philly_numeric <- st_drop_geometry(model_philly) %>% 
#   dplyr::select(sale_price, crime_nn5, 
#                 frontage, total_livable_area, 
#                 distance_to_city_hall, distance_to_nearest_water, 
#                 Estimate_Mean.family.income..dollars.,Percent_White.alone,LPSS_PER1000
#                 ) %>%
#   filter(sale_price <= 1000000) %>%
#   gather(Variable, Value, -sale_price)
# 
# ggplot(model_philly_numeric, aes(Value, sale_price)) +
#   geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#E46726") +
#   facet_wrap(~Variable, ncol = 4, scales = "free") +
#   labs(title = "Price as a function of continuous variables") +
#   theme_minimal() +
#   theme(axis.text = element_blank(), axis.ticks = element_blank())

# all factor variables
# model_philly_factor <- st_drop_geometry(model_philly) %>% 
#   dplyr::select(sale_price, 
#                 central_air, exterior_condition, fireplaces, garage_spaces, general_construction,
#                 interior_condition, number_of_bathrooms, number_of_bedrooms, quality_grade, separate_utilities,
#                 topography, zoning, building_code_description_new) %>%
#   mutate(exterior_condition = as.factor(exterior_condition),
#          fireplaces = as.factor(fireplaces),
#          garage_spaces = as.factor(garage_spaces),
#          interior_condition = as.factor(interior_condition),
#          number_of_bathrooms = as.factor(number_of_bathrooms),
#          number_of_bedrooms = as.factor(number_of_bedrooms)) %>%
#   filter(sale_price <= 1000000) %>%
#   gather(Variable, Value, -sale_price)
# 
# ggplot(model_philly_factor, aes(Value, sale_price)) +
#   geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
#   facet_wrap(~Variable, ncol = 4, scales = "free") +
#   labs(title = "Price as a function of categorical variables", y = "Mean Price") +
#   theme_minimal() +
#   theme(axis.text = element_blank(), axis.ticks = element_blank())  

## Crime cor
# model_philly %>%
#   st_drop_geometry() %>%
#   dplyr::select(sale_price, starts_with("crime")) %>%
#   filter(sale_price <= 1000000) %>%
#   gather(Variable, Value, -sale_price) %>% 
#    ggplot(aes(Value, sale_price)) +
#      geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#E46726") +
#      facet_wrap(~Variable, nrow = 1, scales = "free") +
#      labs(title = "Price as a function of crimes") +
#      theme_minimal()

2.3 Feature Engineering

Transformations of variables can sometimes improve model performance. Three types of transformations are applied in this study: logarithmic transformation, reclassification, and piecewise splitting, with examples shown below.

model_philly_numeric <- st_drop_geometry(model_philly) %>% 
  mutate(total_livable_area_log = log(total_livable_area + 1)) %>%
  dplyr::select(total_livable_area, total_livable_area_log) %>%
  filter(total_livable_area <= 2000000)

plot1 <- ggplot(model_philly_numeric, aes(x = total_livable_area)) +
  geom_histogram(bins = 30, fill = "#6D9EC1", color = "black", alpha = 0.5) +
  labs(title = "Original Total Livable Area") +
  xlim(0, 7500) + 
  coord_cartesian(ylim = c(0, 8000)) + 
  theme_minimal() +
  theme(
    plot.title = element_text(size = 12, hjust = 0.5),
    axis.title.x = element_blank()
  )


plot2 <- ggplot(model_philly_numeric, aes(x = total_livable_area_log)) +
  geom_histogram(bins = 30, fill = "#6D9EC1", color = "black", alpha = 0.5) +
  labs(title = "Log-transformed Total Livable Area") +
  xlim(5, 9) + 
  coord_cartesian(ylim = c(0, 8000)) +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 12, hjust = 0.5),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank() 
  )


combined_plot <- plot1 + plot2 + 
  plot_layout(widths = c(1, 1)) + 
  plot_annotation(
    title = "Distribution of Total Livable Area",
       caption = "Figure 2.3",
    theme = theme(plot.title = element_text(size = 18, face = "bold"),
    plot.caption = element_text(hjust = 0))
  )

combined_plot

Figure 2.3 shows a typical use of log-transformation to the variable total livable area, making a right-skewed distribution more normal, fitting the assumptions of the OLS framework better.

philly_f <- philly %>% 
  mutate(
    exterior_condition = case_when(
      is.na(exterior_condition) | exterior_condition == 1 ~ "1",
      exterior_condition %in% c(2, 3) ~ "2",
      exterior_condition == 4 ~ "3",
      TRUE ~ "4"
    ),
    exterior_condition = factor(exterior_condition)
  )

count_data <- philly %>%
  group_by(exterior_condition) %>%
  summarise(count = n())

mean_price_data <- philly %>%
  group_by(exterior_condition) %>%
  summarise(mean_sale_price = mean(sale_price, na.rm = TRUE))

count_data_f <- philly_f %>%
  group_by(exterior_condition) %>%
  summarise(count = n())

mean_price_data_f <- philly_f %>%
  group_by(exterior_condition) %>%
  summarise(mean_sale_price = mean(sale_price, na.rm = TRUE))


plot1 <- ggplot(count_data, aes(x = exterior_condition, y = count)) +
  geom_bar(stat = "identity", fill = "#6D9EC1", alpha = 0.8, width = 0.6) +
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(size = 12),
    axis.title.x = element_blank()
  )

plot2 <- ggplot(mean_price_data, aes(x = exterior_condition, y = mean_sale_price)) +
  geom_bar(stat = "identity", fill = "#E46726", alpha = 0.8, width = 0.6) +
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(size = 12),
    axis.title.x = element_blank()
  )

plot3 <- ggplot(count_data_f, aes(x = exterior_condition, y = count)) +
  geom_bar(stat = "identity", fill = "#6D9EC1", alpha = 0.8, width = 0.6) +  
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(size = 12),
    axis.title.x = element_blank()
  )

plot4 <- ggplot(mean_price_data_f, aes(x = exterior_condition, y = mean_sale_price)) +
  geom_bar(stat = "identity", fill = "#E46726", alpha = 0.8, width = 0.6) +  
  theme_minimal() +
  theme(
    legend.position = "none", 
    plot.title = element_text(size = 12),
    axis.title.x = element_blank()
  )


combined_plot <- (plot1 | plot2) / (plot3 | plot4) + 
  plot_annotation(
    title = "Exterior Condition and Sale Price",
    subtitle = "Before (Left) and After (Right) Reclassification",
    caption = "Figure 2.4",
    theme = theme(plot.title = element_text(size = 18, face = "bold"),
    plot.caption = element_text(hjust = 0))
  )

combined_plot

Reclassification includes converting a numeric variable into categorial one, and recoding it into fewer categories, as Figure 2.4. This applies to variables that have too few observations in some categories or have mere differences across categories, making the model more robust.

model_philly_numeric <- st_drop_geometry(model_philly) %>% 
  dplyr::select(sale_price, distance_to_city_hall) %>%
  filter(sale_price <= 1000000) %>%
  mutate(DistanceCategory = ifelse(distance_to_city_hall <= 20000, "<= 20000", "> 20000"))

plot1 <- ggplot(subset(model_philly_numeric, DistanceCategory == "<= 20000"),
                aes(distance_to_city_hall, sale_price)) +
  geom_point(size = 0.5, color = "#6D9EC1", alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#E46726") +
  ylab("Sale Price") +
  ggtitle("<= 20000") +  # 添加小标题
  scale_x_continuous(breaks = seq(0, 20000, by = 10000)) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        plot.title = element_text(size = 12, hjust = 0.5)) 

plot2 <- ggplot(subset(model_philly_numeric, DistanceCategory == "> 20000"),
                aes(distance_to_city_hall, sale_price)) +
  geom_point(size = 0.5, color = "#6D9EC1", alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#E46726") +
  ggtitle("> 20000") +  # 添加小标题
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        plot.title = element_text(size = 12, hjust = 0.5)) 

combined_plot <- plot1 + plot2 + 
  plot_layout(widths = c(1, 4)) + 
  plot_annotation(
    title = "Price as a function of Distance to City Hall",
    caption = "Figure 2.5",
    theme = theme(plot.title = element_text(size = 18, face = 'bold'),
                  plot.caption = element_text(hjust = 0))
  )

combined_plot

Cutting continuous variables into pieces is a way of dealing with non-linear relationships, as is shown in Figure 2.5. The sale price sharply decreases and slightly goes up as distance to city hall increases, reflecting the concentric development structure discussed in Figure 2.5. In this case, we added a “dist1” variable, assigning actual values if its distance is less than 20000 ft, or 20000 ft otherwise. This transformation will help achieve the effect of piecewise function in linear regressions.

philly_fac <- philly %>%
  mutate(
    # basements = case_when(
    #   basements %in% c("0", "1", "4", "A") ~ "B",   
    #   TRUE ~ "A"                               
    # ),
    # basements = factor(basements),
    
    central_air = case_when(
      central_air %in% c("1", "Y") ~ "Y",  
      TRUE ~ "N"                          
    ),
    central_air = factor(central_air),
    
    exterior_condition = case_when(
      is.na(exterior_condition) | exterior_condition == 1 ~ "1",    
      exterior_condition %in% c(2, 3) ~ "2",                        
      exterior_condition == 4 ~ "3",                                
      TRUE ~ "4"                      
    ),
    exterior_condition = factor(exterior_condition),
    
    fireplaces = case_when(
      is.na(fireplaces) | fireplaces == 0 | fireplaces == 1 ~ "0",                    
      TRUE ~ "1"                                
    ),
    fireplaces = factor(fireplaces),
    
    # fuel = case_when(
    #   is.na(fuel) | fuel == "" ~ "X",  
    #   TRUE ~ "A"                       
    # ),
    # fuel = factor(fuel),
    
    garage_spaces = case_when(
      is.na(garage_spaces) | garage_spaces == 0 ~ "0",    
      garage_spaces == 1 ~ "1",      
      TRUE ~ "2"                     
    ),
    garage_spaces = factor(garage_spaces),
    
    general_construction = case_when(       
      general_construction %in% c("A ", "B ", "D ") ~ "A",  
      general_construction == "3 " ~ "C",        
      TRUE ~ "B"                                 
    ),
    general_construction = factor(general_construction),
    
    interior_condition = case_when(
      interior_condition %in% c(0, 1, 2, 3) ~ "1",         
      interior_condition %in% c(4, NA) ~ "2",        
      TRUE ~ "3"       
    ),
    interior_condition = factor(interior_condition),
    
    number_of_bathrooms = case_when(
      number_of_bathrooms == 0 ~ "0",                   
      number_of_bathrooms == 1 ~ "1",                   
      number_of_bathrooms %in% c(2, NA) ~ "2",          
      number_of_bathrooms == 3 ~ "3",                   
      TRUE ~ "4"                                        
    ),
    number_of_bathrooms = factor(number_of_bathrooms),
    
    number_of_bedrooms = case_when(
      number_of_bedrooms %in% c(0, 31) | is.na(number_of_bedrooms) ~ "0",  
      number_of_bedrooms %in% c(1, 2) ~ "1",    
      number_of_bedrooms == 3 ~ "2",            
      number_of_bedrooms == 4 ~ "3",            
      number_of_bedrooms %in% c(5, 6) ~ "4",    
      number_of_bedrooms %in% c(7, 8, 9) ~ "5" 
    ),
    number_of_bedrooms = factor(number_of_bedrooms),
    
    # parcel_shape = case_when(
    #   parcel_shape %in% c("C", "E") ~ "A",  
    #   parcel_shape == "A" ~ "B",            
    #   TRUE ~ "C"                            
    # ),
    # parcel_shape = factor(parcel_shape),
    
    quality_grade = case_when(
      quality_grade %in% c("A+", "A", "A-", "X", "X-") ~ "1",           
      quality_grade %in% c("B+", "B", "B-", "3", "S+") ~ "2",
      TRUE ~ "3"                                             
    ),
    quality_grade = factor(quality_grade),
    
    separate_utilities = case_when(
      separate_utilities %in% c("A", "C") ~ "B",  
      TRUE ~ "A"                                 
    ),
    separate_utilities = factor(separate_utilities),
    
    # street_direction = case_when(
    #   is.na(street_direction) | street_direction == "" | street_direction == "S" ~ "A",
    #   TRUE ~ "B"                           
    # ),
    # street_direction = factor(street_direction),
    
    topography = case_when(
      topography == "B" ~ "B",
      TRUE ~ "A"
    ),
    topography = factor(topography),
    
    # type_heater = case_when(
    #   type_heater == "D" ~ "B",            
    #   TRUE ~ "A"                           
    # ),
    # type_heater = factor(type_heater),
    
    building_code_description_new = case_when(
      building_code_description_new %in% c("COLONIAL", "OLD STYLE", "ROW MODERN", "ROW OLD STYLE", "TUDOR", "TWIN BUNGALOW") ~ "1",
      building_code_description_new %in% c("ROW POST WAR", "ROW TYPICAL", "TWIN CONVENTIONAL") ~ "3",
      building_code_description_new %in% c("OTHER", "ROW PORCH FRONT") ~ "4",
      TRUE ~ "2"
    ),
    building_code_description_new = factor(building_code_description_new),
    
    NAME = case_when(
      NAME %in% c("CHINATOWN", "CRESTMONT_FARMS", "EAST_PARK", "MECHANICSVILLE", 
                  "PENNYPACK_PARK", "UNIVERSITY_CITY", "WISSAHICKON_PARK", "WOODLAND_TERRACE") ~ "OTHERS",
      TRUE ~ NAME 
    ),
    NAME = factor(NAME)
  )
philly <- philly_fac

model_philly <- philly %>%
  filter(toPredict == "MODELLING")

st_drop_geometry(model_philly) %>% 
  dplyr::select(sale_price, total_livable_area, crime_nn5, distance_to_city_hall, Estimate_Mean.family.income..dollars.) %>%
  filter(sale_price <= 2000000 & crime_nn5 <= 1000 & total_livable_area <= 7500) %>%
  gather(Variable, Value, -sale_price) %>%
  ggplot(aes(Value, sale_price)) +
  geom_point(size = 0.5, colour = "#6D9EC1", alpha = 0.5) +  
  geom_smooth(method = "lm", se = FALSE, colour = "#E46726") + 
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price as a function of continuous variables",
       caption = "Figure 2.6") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.caption = element_text(hjust = 0)
  )

Figure 2.6 includes scatterplots for sale price against four selected continuous variables. Besides “distance to city hall” we have addressed, positive relationships between sale price and the other three variables are explicit. The points in the plots are dispersed rather than being around the line, showing the sense of Hedonic Pricing Model’s aim to value all services to define a total price, reducing the variance from any single variable.

ggplot() +
  geom_sf(data = nhoods, , fill = "grey40", color = "grey", na.rm = T) +
  geom_sf(data = philly %>% filter(!is.na(LPSS_PER1000)), aes(colour = q5(LPSS_PER1000)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(philly,"LPSS_PER1000"),
                   name="Quintile\nBreaks") +
  guides(color = guide_legend(override.aes = list(size = 3))) + 
  labs(title="Low-Produce Supply Stores",
       subtitle = "per 1000 People",
       caption = "Figure 2.7") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

Low-produce supply stores (LPSS) include dollar stores, pharmacies and convenience stores. It is a measure of accessibility to affordable stores, having high density in areas with lower income in north and west Philadelphia, as shown in Figure 2.7. When aligned with other variables, LPSS distinguishes these neighborhoods well and enhances model accuracy.

ggplot() +
  geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = T) +
  geom_sf(data = philly %>% filter(!is.na(exterior_condition)), aes(colour = exterior_condition), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = colorRampPalette(c("#E46726", "#6D9EC1"))(4)
                   ) +
  guides(color = guide_legend(override.aes = list(size = 3))) + 
  labs(title="Exterior Condition",
       caption = "Figure 2.8") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

The spatial distribution of exterior conditions is as Figure 2.8 (1 = best, 4 = worst). Homes with the best exterior conditions are close to the center city, and west Philadelphia shows a mix of both well- and ill-conditioned homes, instead of corresponding only with the wealthiness of the neighborhoods.

ggplot() +
  geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = TRUE) +
  geom_sf(data = philly %>% filter(!is.na(zoning)), aes(colour = zoning), 
          show.legend = "point", size = 0.75) +
  scale_colour_manual(values = c(
    "RM1" = "#FCB951", "RM2" = "#FCB951", "RM3" = "#FCB951", "RM4" = "#FCB951",
    "RMX1" = "#E58425", "RMX2" = "#E58425", "RMX3" = "#E58425",
    "RSA1" = "#F8EE68", "RSA2" = "#F8EE68", "RSA3" = "#F8EE68", "RSA4" = "#F8EE68", 
    "RSA5" = "#F8EE68", "RSA6" = "#F8EE68",
    "RSD1" = "#FFF5C4", "RSD2" = "#FFF5C4", "RSD3" = "#FFF5C4",
    "RTA1" = "#CDB650"
  )) +
  guides(color = guide_legend(override.aes = list(size = 3), ncol = 2)) + 
  labs(title = "Zoning", subtitle = "Philadelphia, 2022",
       caption = "Figure 2.9") +
  mapTheme() +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 12),
    panel.border = element_rect(colour = "black", fill = NA, size = 1)
  )

Figure 2.9 visualizes the zoning status, colored in the official colors. The most common type of zoning is RSA (Residential Single-Family Attached). RM (Residential Multi-Family), with higher density, are clustered along Board St, near center city and a couple of areas in west Philadelphia, while RSD (Residential Single-Family Detached), with lower density, are spread across far northwest and northeast suburbs.

# cor_philly <- 
#   st_drop_geometry(model_philly) %>%
#   filter(sale_price <= 1000000,
#          total_livable_area < 10000) %>%
#   dplyr::select(sale_price, total_livable_area)
# 
# cor.test(cor_philly$total_livable_area,
#          cor_philly$sale_price, 
#          method = "pearson")
# 
# 
# # Calculate predicted prices
# cor_philly$predicted_price <- predict(lm(sale_price ~ total_livable_area, data = cor_philly), newdata = cor_philly)
# ggscatter(cor_philly,
#           x = "total_livable_area",
#           y = "sale_price",
#           color = "#6D9EC1", size = 1, alpha = 0.6) +
#   geom_point(aes(y = predicted_price), color = "#E46726", size = 1, alpha = 0.6) +
#   stat_cor(label.x = 4000, label.y = 250000, hjust = 0) +
#   labs(title = "Price as a function of living area", 
#        subtitle = "With predicted prices; Sale prices <= $1 mil.",
#        x = "Total Livable Area",
#        y = "Sale Price") +
#   theme_minimal() +
#   theme(plot.title = element_text(size = 18, face = "bold"),
#         plot.subtitle = element_text(size = 12))

3 Methods

The core of our algorithm is Ordinary Least Square (OLS) regression model. We used 9 continuous variables and 13 reclassified categorical variables as independent variables and sale price as the dependent variable.

To build and evaluate our model, we split the dataset into two parts: 60% as the training set to fit the model, and 40% as the test set to evaluate its accuracy. The model was tested using Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and prediction v.s. actual value scatter plot.

To ensure the accuracy of the model, we applied a 100-fold cross validation process. This divides the data into 100 subsets, repeatedly testing the model to see how well it performs across different samples. Finally, we examined the generalizability of the model across different neighborhood contexts by observing the distribution of spatial lag, calculating Moran’s I (a measure of spatial clustering in model errors), and testing the model in different neighborhood settings.

Analyses were conducted using R version 4.4.1.

4 Results

Using the features selected above, our model explains 76% of the variation in price and produces average errors of 36% relative to housing prices. In terms of cross-validation, our model was well-trained on the available variables and shows high accuracy for low-priced houses. However, it demonstrates less consistency and a slight positive spatial autocorrelation, suggesting the influence of unaccounted factors. At the urban scale, our model is less effective at predicting prices in low-income neighborhoods and reveals a significant discrepancy between high- and low-income areas, indicating a further development in generalizability.

4.1 Our Regression Model: explaining 76% of the prices

To improve the model’s accuracy and generalizability, we randomly split the dataset into 60% for training the model and 40% for testing its goodness of fit. Using the selected features, OLS regression was performed on the training set.

set.seed(34)
inTrain <- createDataPartition(
              y = paste(model_philly$general_construction, 
                       model_philly$topography,
                       model_philly$quality_grade,
                        model_philly$NAME),
                        
              p = .60, list = FALSE)
# [row,column] select all columns
philly.training <- model_philly[inTrain,] 
philly.test <- model_philly[-inTrain,]

reg.training <- lm(sale_price ~ log(total_livable_area+1) + total_livable_area + crime_nn5 + LPSS_PER1000 +
              log(dist1+1) +log(distance_to_city_hall+1) + log(distance_to_nearest_water+1) +
              Estimate_Mean.family.income..dollars. + Percent_White.alone + log(frontage+1) +  
              
              central_air + exterior_condition + fireplaces + garage_spaces + general_construction + 
              interior_condition + number_of_bathrooms + number_of_bedrooms + quality_grade + 
              separate_utilities + topography + building_code_description_new + zoning + 
              NAME,
    data = st_drop_geometry(philly.training))

kable(summary(reg.training)$coefficients, digits = 3,format = "html",caption = "<b>Table 4.1 OLS Regression Results</b>", align = "ccccc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover")) %>%
  scroll_box(height = "300px")

**Table 4.1 OLS Regression Results**
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	2602553.344	159428.947	16.324	0.000
log(total_livable_area + 1)	-53340.123	4227.568	-12.617	0.000
total_livable_area	203.405	3.891	52.272	0.000
crime_nn5	99.448	21.919	4.537	0.000
LPSS_PER1000	-355.719	62.060	-5.732	0.000
log(dist1 + 1)	-311331.292	28622.655	-10.877	0.000
log(distance_to_city_hall + 1)	84882.204	23567.098	3.602	0.000
log(distance_to_nearest_water + 1)	-4915.598	2243.416	-2.191	0.028
Estimate_Mean.family.income..dollars.	0.315	0.047	6.723	0.000
Percent_White.alone	310.545	112.103	2.770	0.006
log(frontage + 1)	32602.610	2292.147	14.224	0.000
central_airY	18501.777	2988.092	6.192	0.000
exterior_condition2	-53547.193	7303.736	-7.331	0.000
exterior_condition3	-68580.559	7492.304	-9.153	0.000
exterior_condition4	-91899.003	8674.360	-10.594	0.000
fireplaces1	67068.917	13085.724	5.125	0.000
garage_spaces1	15842.471	2864.147	5.531	0.000
garage_spaces2	78115.398	8498.819	9.191	0.000
general_constructionB	-11270.490	3656.738	-3.082	0.002
general_constructionC	189225.510	20068.304	9.429	0.000
interior_condition2	-42242.750	3470.206	-12.173	0.000
interior_condition3	-56030.547	6584.426	-8.510	0.000
number_of_bathrooms1	-18625.215	6977.063	-2.669	0.008
number_of_bathrooms2	4351.026	7280.922	0.598	0.550
number_of_bathrooms3	55836.812	9239.940	6.043	0.000
number_of_bathrooms4	326526.524	16623.536	19.642	0.000
number_of_bedrooms1	22870.061	7307.230	3.130	0.002
number_of_bedrooms2	39637.760	6887.849	5.755	0.000
number_of_bedrooms3	15978.172	7511.094	2.127	0.033
number_of_bedrooms4	-14766.514	10373.662	-1.423	0.155
number_of_bedrooms5	-290198.487	27328.456	-10.619	0.000
quality_grade2	-64882.420	19616.440	-3.308	0.001
quality_grade3	-74504.377	19450.009	-3.831	0.000
separate_utilitiesB	-38831.626	8023.355	-4.840	0.000
topographyB	239582.260	34833.986	6.878	0.000
building_code_description_new2	-15340.877	5805.798	-2.642	0.008
building_code_description_new3	-16182.590	5557.419	-2.912	0.004
building_code_description_new4	-19220.374	6137.199	-3.132	0.002
zoningRM2	7746.895	14385.569	0.539	0.590
zoningRM3	-9737.974	23710.849	-0.411	0.681
zoningRM4	-32379.920	16618.804	-1.948	0.051
zoningRMX1	63005.260	24701.659	2.551	0.011
zoningRMX2	72013.347	41015.004	1.756	0.079
zoningRMX3	-166525.907	13033.349	-12.777	0.000
zoningRSA1	-66667.854	24014.839	-2.776	0.006
zoningRSA2	-27635.835	9956.102	-2.776	0.006
zoningRSA3	-24865.513	6233.913	-3.989	0.000
zoningRSA4	-29916.407	15535.375	-1.926	0.054
zoningRSA5	2523.187	3963.322	0.637	0.524
zoningRSA6	30188.602	34770.908	0.868	0.385
zoningRSD1	-57179.970	16086.302	-3.555	0.000
zoningRSD2	-76399.095	24976.463	-3.059	0.002
zoningRSD3	-55811.352	10447.264	-5.342	0.000
zoningRTA1	-17908.382	17482.315	-1.024	0.306
NAMEALLEGHENY_WEST	-29308.738	33195.930	-0.883	0.377
NAMEANDORRA	-11064.666	30574.876	-0.362	0.717
NAMEASTON_WOODBRIDGE	-4960.254	27215.358	-0.182	0.855
NAMEBARTRAM_VILLAGE	2026.783	44167.698	0.046	0.963
NAMEBELLA_VISTA	-9689.404	39012.675	-0.248	0.804
NAMEBELMONT	-96596.983	39534.258	-2.443	0.015
NAMEBLUE_BELL_HILL	23855.225	52093.575	0.458	0.647
NAMEBREWERYTOWN	-66421.604	34047.828	-1.951	0.051
NAMEBRIDESBURG	-33288.625	28495.489	-1.168	0.243
NAMEBURHOLME	22678.195	33311.026	0.681	0.496
NAMEBUSTLETON	9083.709	21455.471	0.423	0.672
NAMEBYBERRY	-152303.892	50853.624	-2.995	0.003
NAMECALLOWHILL	-57479.036	67376.749	-0.853	0.394
NAMECARROLL_PARK	-13639.025	32370.249	-0.421	0.674
NAMECEDAR_PARK	91848.072	36103.192	2.544	0.011
NAMECEDARBROOK	-29504.447	25464.674	-1.159	0.247
NAMECHESTNUT_HILL	192433.090	26600.252	7.234	0.000
NAMECLEARVIEW	-2487.014	39341.471	-0.063	0.950
NAMECOBBS_CREEK	16924.000	31550.192	0.536	0.592
NAMECRESCENTVILLE	-1060.511	57603.588	-0.018	0.985
NAMEDEARNLEY_PARK	-57664.481	34975.202	-1.649	0.099
NAMEDICKINSON_NARROWS	-61449.378	35922.057	-1.711	0.087
NAMEDUNLAP	-54856.694	41086.849	-1.335	0.182
NAMEEAST_FALLS	99498.823	31849.632	3.124	0.002
NAMEEAST_KENSINGTON	50476.992	33709.002	1.497	0.134
NAMEEAST_OAK_LANE	-75957.043	29019.226	-2.617	0.009
NAMEEAST_PARKSIDE	-96268.424	39336.281	-2.447	0.014
NAMEEAST_PASSYUNK	33483.894	34657.539	0.966	0.334
NAMEEAST_POPLAR	-106961.230	48732.852	-2.195	0.028
NAMEEASTWICK	-622.977	34845.180	-0.018	0.986
NAMEELMWOOD	4793.280	31604.040	0.152	0.879
NAMEFAIRHILL	-26062.385	38193.657	-0.682	0.495
NAMEFAIRMOUNT	45949.918	35408.313	1.298	0.194
NAMEFELTONVILLE	-15557.392	28430.643	-0.547	0.584
NAMEFERN_ROCK	-33460.333	31686.198	-1.056	0.291
NAMEFISHTOWN	55347.942	32925.212	1.681	0.093
NAMEFITLER_SQUARE	354300.464	43313.640	8.180	0.000
NAMEFOX_CHASE	18939.843	22049.062	0.859	0.390
NAMEFRANCISVILLE	-93953.052	37651.349	-2.495	0.013
NAMEFRANKFORD	-55434.931	25885.719	-2.142	0.032
NAMEFRANKLINVILLE	-20779.315	36170.837	-0.574	0.566
NAMEGARDEN_COURT	137900.477	42495.015	3.245	0.001
NAMEGERMANTOWN_EAST	-42393.000	26355.213	-1.609	0.108
NAMEGERMANTOWN_MORTON	-30698.933	29280.182	-1.048	0.294
NAMEGERMANTOWN_PENN_KNOX	9668.388	43071.006	0.224	0.822
NAMEGERMANTOWN_SOUTHWEST	-32441.731	31351.090	-1.035	0.301
NAMEGERMANTOWN_WEST_CENT	-23829.366	32415.715	-0.735	0.462
NAMEGERMANTOWN_WESTSIDE	-15048.286	38166.730	-0.394	0.693
NAMEGERMANY_HILL	29758.118	29312.835	1.015	0.310
NAMEGIRARD_ESTATES	19701.931	33865.308	0.582	0.561
NAMEGLENWOOD	-50566.283	38874.899	-1.301	0.193
NAMEGRADUATE_HOSPITAL	-21343.972	37725.365	-0.566	0.572
NAMEGRAYS_FERRY	-86619.740	33827.883	-2.561	0.010
NAMEGREENWICH	-19475.542	37230.024	-0.523	0.601
NAMEHADDINGTON	-9678.776	31976.804	-0.303	0.762
NAMEHARROWGATE	-17226.857	30729.207	-0.561	0.575
NAMEHARTRANFT	-76415.644	33764.438	-2.263	0.024
NAMEHAVERFORD_NORTH	-119916.837	44240.581	-2.711	0.007
NAMEHAWTHORNE	-65396.685	42714.996	-1.531	0.126
NAMEHOLMESBURG	-36366.901	21898.495	-1.661	0.097
NAMEHUNTING_PARK	-15837.410	30663.227	-0.516	0.606
NAMEJUNIATA_PARK	-9587.595	28742.008	-0.334	0.739
NAMEKINGSESSING	-16713.237	32894.986	-0.508	0.611
NAMELAWNDALE	-23342.252	24375.594	-0.958	0.338
NAMELEXINGTON_PARK	47580.634	29014.632	1.640	0.101
NAMELOGAN	-50539.356	28168.190	-1.794	0.073
NAMELOGAN_SQUARE	96881.443	41583.433	2.330	0.020
NAMELOWER_MOYAMENSING	-14796.671	33166.742	-0.446	0.656
NAMELUDLOW	-126672.567	44551.269	-2.843	0.004
NAMEMANAYUNK	23543.315	26340.398	0.894	0.371
NAMEMANTUA	-100955.282	37763.526	-2.673	0.008
NAMEMAYFAIR	-3113.029	22504.502	-0.138	0.890
NAMEMCGUIRE	-20470.241	40807.442	-0.502	0.616
NAMEMELROSE_PARK_GARDENS	-49188.376	32788.425	-1.500	0.134
NAMEMILL_CREEK	-28154.092	34665.280	-0.812	0.417
NAMEMILLBROOK	-32022.950	28539.903	-1.122	0.262
NAMEMODENA	-25566.749	24704.705	-1.035	0.301
NAMEMORRELL_PARK	-20171.705	26015.772	-0.775	0.438
NAMEMOUNT_AIRY_EAST	-13797.497	24467.224	-0.564	0.573
NAMEMOUNT_AIRY_WEST	-3215.829	26921.057	-0.119	0.905
NAMENEWBOLD	-33112.641	35184.325	-0.941	0.347
NAMENICETOWN	-4117.793	35047.196	-0.117	0.906
NAMENORMANDY_VILLAGE	10148.206	43143.818	0.235	0.814
NAMENORTH_CENTRAL	-110199.533	34929.846	-3.155	0.002
NAMENORTHERN_LIBERTIES	-54988.249	36717.795	-1.498	0.134
NAMENORTHWOOD	-39679.709	27897.136	-1.422	0.155
NAMEOGONTZ	-37362.801	26169.313	-1.428	0.153
NAMEOLD_KENSINGTON	-21715.145	35840.606	-0.606	0.545
NAMEOLNEY	-35809.804	25780.602	-1.389	0.165
NAMEOTHERS	159496.761	47821.591	3.335	0.001
NAMEOVERBROOK	-7237.417	28863.711	-0.251	0.802
NAMEOXFORD_CIRCLE	-11048.623	22848.657	-0.484	0.629
NAMEPACKER_PARK	36516.927	38251.130	0.955	0.340
NAMEPARKWOOD_MANOR	-29901.403	26079.668	-1.147	0.252
NAMEPASCHALL	-17043.822	31813.161	-0.536	0.592
NAMEPASSYUNK_SQUARE	-24210.496	36061.728	-0.671	0.502
NAMEPENNSPORT	-27408.722	35062.664	-0.782	0.434
NAMEPENNYPACK	16373.339	23556.723	0.695	0.487
NAMEPENNYPACK_WOODS	-32622.449	31699.974	-1.029	0.303
NAMEPENROSE	4735.753	34266.872	0.138	0.890
NAMEPOINT_BREEZE	-92744.583	34518.047	-2.687	0.007
NAMEPOWELTON	70861.367	68260.900	1.038	0.299
NAMEQUEEN_VILLAGE	42061.360	37537.584	1.121	0.263
NAMERHAWNHURST	31930.280	21888.038	1.459	0.145
NAMERICHMOND	41300.018	31596.752	1.307	0.191
NAMERITTENHOUSE	225261.447	42233.350	5.334	0.000
NAMERIVERFRONT	-18817.788	59635.268	-0.316	0.752
NAMEROXBOROUGH	37129.366	26384.156	1.407	0.159
NAMEROXBOROUGH_PARK	23579.964	38195.037	0.617	0.537
NAMESHARSWOOD	-76375.672	45785.346	-1.668	0.095
NAMESOCIETY_HILL	157827.536	39694.403	3.976	0.000
NAMESOMERTON	2882.903	21870.934	0.132	0.895
NAMESOUTHWEST_SCHUYLKILL	-15319.075	34234.156	-0.447	0.655
NAMESPRING_GARDEN	-22094.949	39599.016	-0.558	0.577
NAMESPRUCE_HILL	219723.879	40576.763	5.415	0.000
NAMESTADIUM_DISTRICT	47099.103	35656.531	1.321	0.187
NAMESTANTON	-105770.026	33704.611	-3.138	0.002
NAMESTRAWBERRY_MANSION	-82565.409	33359.539	-2.475	0.013
NAMESUMMERDALE	-21538.879	27240.615	-0.791	0.429
NAMETACONY	-38871.002	22737.482	-1.710	0.087
NAMETIOGA	-44156.582	32954.713	-1.340	0.180
NAMETORRESDALE	-12148.653	23285.266	-0.522	0.602
NAMEUPPER_KENSINGTON	-21619.929	31866.999	-0.678	0.498
NAMEUPPER_ROXBOROUGH	-255.324	24484.536	-0.010	0.992
NAMEWALNUT_HILL	66325.715	39412.958	1.683	0.092
NAMEWASHINGTON_SQUARE	-38814.789	44051.355	-0.881	0.378
NAMEWEST_KENSINGTON	-21448.120	33916.230	-0.632	0.527
NAMEWEST_OAK_LANE	-38806.419	23723.676	-1.636	0.102
NAMEWEST_PARKSIDE	-22817.113	57916.022	-0.394	0.694
NAMEWEST_PASSYUNK	-57094.888	34254.726	-1.667	0.096
NAMEWEST_POPLAR	-149198.102	49603.709	-3.008	0.003
NAMEWEST_POWELTON	-72672.246	40909.125	-1.776	0.076
NAMEWHITMAN	-26867.258	34173.513	-0.786	0.432
NAMEWINCHESTER_PARK	-10616.927	33056.006	-0.321	0.748
NAMEWISSAHICKON	32862.285	30044.667	1.094	0.274
NAMEWISSAHICKON_HILLS	40770.009	35614.181	1.145	0.252
NAMEWISSINOMING	-27769.979	23659.387	-1.174	0.241
NAMEWISTER	-48071.736	31639.190	-1.519	0.129
NAMEWYNNEFIELD	-31312.245	32489.813	-0.964	0.335
NAMEWYNNEFIELD_HEIGHTS	87649.238	39433.444	2.223	0.026
NAMEYORKTOWN	-120781.881	47932.988	-2.520	0.012

In our model, the features explain approximately 76% of the variation in price (Adjusted R² = 75.73%) and most of their coefficients are statistically significant (p<0.05). Several features, such as general construction and architectural styles, are converted to categorical variables, allowing their coefficients to be estimated and linked to housing prices. The remaining insignificant coefficients are of neighborhood names or zoning codes, which are less likely to regroup due to spatial disparities.

4.2 Accuracy: better for sales priced below $1,000,000

To dictate how useful the model is for decision making, we used testing set to analyse goodness of fit indicators.

philly.test <-
  philly.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(reg.training, philly.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
  filter(sale_price < 5000000)

MAE <- mean(philly.test$sale_price.AbsError, na.rm = T)
MAPE <- mean(philly.test$sale_price.APE, na.rm = T)
acc <- data.frame(MAE, MAPE)

acc %>%
  knitr::kable(caption = "<b>Table 4.2 MAE and MAPE of the testing dataset</b>", align = "cc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover"))

**Table 4.2 MAE and MAPE of the testing dataset**
MAE	MAPE
65671.32	0.3592045

The Mean Absolute Error (MAE) is $65671.32, representing the average difference between predicted and observed prices. The relatively high value reflects the diversity in housing prices, suggesting a limitation for linear regression. The Mean Absolute Percent Error (MAPE) indicates that the prediction error amounts to 35.92% of the housing prices on average.

ggplot(
  philly.test, aes(sale_price.Predict, sale_price)) +
  geom_point(size = .5) + 
  geom_smooth(method = "lm", se=F, colour = "#6D9EC1") +
  geom_abline(intercept = 0, slope = 1, color="#E46726",size=1) +
  theme_minimal() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = ggtext::element_markdown(size = 12),
        plot.caption = element_text(hjust = 0)) +
  labs(title = 'Predicted sale price as a function of observed price', 
       subtitle = glue("<span style='color:#E46726;'>Perfect prediction</span> vs. <span style='color:#6D9EC1;'>Average prediction</span>"),caption = "Figure 4.1",
       x = "Predicted Sale Price",
       y = "Observed Sale Price")

Given the high MAE and MAPE, data visualization was used to compare predicted and observed prices. The red and blue lines nearly overlap, indicating the model performs well overall. However, the slight deviation suggests that, on average, the model’s predictions are a little higher than the observed prices. Examining the data points further reveals that prediction accuracy decreases as prices rise, confirming our concerns about the variability in the higher price ranges of the dataset. This suggests that our model performs particularly well for lower-valued housing prices.

4.3 Generalizability: cross-validation shows less consistency

After making predictions on a single hold-out test set, we partitioned the dataset into 100 equal-sized subsets, training on the remaining data and testing on each subset. This process is known as the K-fold cross-validation method.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv <- 
  train(sale_price ~ log(total_livable_area+1) + total_livable_area + crime_nn5 + LPSS_PER1000 +
              log(dist1+1) +log(distance_to_city_hall+1) + log(distance_to_nearest_water+1) +
              Estimate_Mean.family.income..dollars. + Percent_White.alone + log(frontage+1) +  
              
              central_air + exterior_condition + fireplaces + garage_spaces + general_construction + 
              interior_condition + number_of_bathrooms + number_of_bedrooms + quality_grade + 
              separate_utilities + topography + building_code_description_new + zoning + 
              NAME,
    
    data = st_drop_geometry(model_philly),
    method = "lm", trControl = fitControl, na.action = na.pass)

kable(reg.cv$results, digits = 3,caption = "<b>Table 4.3 Cross-Validation Results</b>", align = "ccccccc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover"))

**Table 4.3 Cross-Validation Results**
intercept	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
TRUE	115492.5	0.765	65851.93	37687.33	0.07	6719.758

Mean <- mean(reg.cv$resample[,3])
SD <- sd(reg.cv$resample[,3])
stat_MAE <- data.frame(Mean, SD)

stat_MAE %>%
  knitr::kable(caption = "<b>Table 4.4 Mean and standard deviation of the cross-validation MAE</b>", align = "cc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover"))

**Table 4.4 Mean and standard deviation of the cross-validation MAE**
Mean	SD
65851.93	6719.758

The standard deviation of MAE (6719.76) reflects significant variation across the 100 folds. Additionally, the mean of MAE in cross-validation (65851.93) is quite similar to the MAE in our test set (65671.32), indicating that the model was well-trained on the existing variables and therefore shows an average error.

ggplot(data = data.frame(mae = reg.cv$resample[,3]),aes(x = mae)) +
  geom_histogram(color = "white", fill = "#6D9EC1") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 50),
                     labels = function(x) ifelse(x %% 5000 == 0, x, "")) +
  labs(title = "Distribution of MAE", 
       subtitle = "K fold cross validation; k=100.",
       caption = "Figure 4.2",
       x = "Mean Absolute Error", y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(hjust = 0)) +
  geom_vline(aes(xintercept=mean(mae, na.rm=T)),
               color="#E46726", linetype="dashed", size=1) +
  geom_vline(aes(xintercept= MAE),
               color="#E46726", linetype="dashed", size=.8, alpha = .8) +
  annotate(geom = "text", label = paste("Mean of MAE:", round(Mean, 2)), # Insert correlation here
    x = 67000, y = 10, 
    hjust = 0, size = 4, color = "#E46726") +
  annotate(geom = "text", label = paste("MAE in the test set:\n", round(MAE, 2)), # Insert correlation here
    x = 61000, y = 9, 
    hjust = 1, size = 4, color = "#E46726", alpha = .8)

Visualized in a histogram, the distribution of MAE in cross-validation peaks around 63000 and exhibits a long right tail. This suggests that the model predicts inconsistently and may be unreliable for predicting houses that have not been sold recently.

4.4 Generalizability: positive spatial autocorrelation exists

philly.test_predict <-
  philly.test[which(philly.test$sale_price.Error != 0),]

coords.test <-  st_coordinates(philly.test_predict) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")
philly.test_predict$lagPriceError <- lag.listw(spatialWeights.test, philly.test_predict$sale_price.Error)

philly.test <- philly.test_predict

ggplot() +
  geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = T) +
  geom_sf(data = philly.test_predict, aes(colour = q5(sale_price.Error), na.rm = T),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(philly.test_predict,"sale_price.Error"),
                   name="Quintile Breaks") +
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(title="Sale price errors on the test set",
       caption = "Figure 4.3") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

What might be causing the high MAE? Although our model incorporates internal characteristics, public services/amenities, and some spatial features, other factors influencing the spatial distribution of home prices remain unexplored. As the map illustrates, sale price errors in Philadelphia are generally randomly distributed, but with some clustering both within and across neighborhoods. How much do our model errors cluster? Calculating the average weighted model error of its five nearest neighbors as the “spatial lag”, we delved deeper into the spatial autocorrelation test.

cor_error <- cor(philly.test_predict$lagPriceError, philly.test_predict$sale_price.Error, method = "pearson")
ggplot(philly.test_predict, aes(x = lagPriceError, y = sale_price.Error))+
     geom_point(size = .5, color = "#6D9EC1") + geom_smooth(method = "lm", se=F, colour = "#E46726") +
     labs(title = "Error as a function of the spatial lag of price",
          x = "Spatial lag of errors (Mean error of 5 nearest neighbors)", y = "Sale price errors",
       caption = "Figure 4.4") +
  annotate(geom = "text", label = paste("Pearson Correlation:", round(cor_error, 3)), # Insert correlation here
    x = -700000, y = -1000000, 
    hjust = 0, vjust = 2, 
    size = 4) +
     theme_minimal() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.caption = element_text(hjust = 0))

Comparing the spatial lag with the actual values helps estimate the degree of spatial autocorrelation. The relationship visualized above shows that as home price errors increase, nearby home price errors tend to rise as well. The correlation is 0.225, which is marginal but significant.

philly.test_nonzero <- philly.test %>%
  filter(sale_price.Error != 0)
moranTest <- moran.mc(philly.test_nonzero$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#E46726",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="<span style='color:#E46726;'>Observed</span> and <span style='color:#7F7F7F;'>permuted</span> Moran's I",
       x="Moran's I",
       y="Count",
       caption = "Figure 4.5") +
  annotate(geom = "text", label = paste("Statistic:", round(moranTest$statistic,3),"\np-value:", moranTest$p.value), # Insert correlation here
    x = 0.2, y = 300, 
    hjust = 0, 
    size = 4) +
  theme_minimal() +
  theme(plot.title = ggtext::element_markdown(size = 18, face = "bold"),
        plot.caption = element_text(hjust = 0))

Another approach to measure spatial autocorrelation is Moran’s I, where a positive value close to one indicates strong positive spatial autocorrelation. The histogram above shows 999 randomly permuted I values, with the observed I marked by the orange line. The observed I, higher than all random permutations, confirms spatial autocorrelation. An I of 0.117 seems marginal, but a p-value of 0.001 still indicates statistically significant clustering.

Both the spatial lag and Moran’s I test show that our model errors exhibit a slight positive spatial autocorrelation, suggesting the presence of unaccounted factors.

4.5 Generalizability: less effective for low-priced neighborhoods and low-income tracts

How does the spatial autocorrelation manifest in our model? From the perspective of geospatial features, including neighborhoods and median income, we further analysed the spatial process of sale price predictions and tested the generalizability across urban contexts.

nhoods_model <- st_drop_geometry(philly.test) %>%
  group_by(NAME) %>%
  summarise(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
  ungroup() %>%
  left_join(nhoods) %>%
  st_sf()

ggplot() +
  geom_sf(data = nhoods, fill = "transparent", color = "grey") +
  geom_sf(data = nhoods_model, aes(fill = mean.MAPE), color = "grey") +
  scale_fill_gradient(low = "#6D9EC1", high = "#E46726", name = "MAPE") +
  labs(title = "Mean test set MAPE by neighborhood",
       caption = "Figure 4.6") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

The MAPE by neighborhood reveals significantly less accurate predictions in North and West Philadelphia (highlighted in red). The vacant areas have no housing sales data, including northern Center City, University City, Northeast Philadelphia Airport and its surroundings, southern South Philadelphia, and parks. In North Philadelphia, the red areas are predominantly home to Hispanic and Black residents, while the surrounding population is more diverse, contributing to the high MAPE. In West Philadelphia, the impact of the University of Pennsylvania’s role in gentrification in the east and rising crime in the west have led to deteriorating areas with high MAPE. Both red areas also have lower sale prices, confirming the spatial pattern of model generalizability.

nhoods_model <- left_join(nhoods_model,
  st_drop_geometry(philly.test) %>%
    group_by(NAME) %>%
    summarise(meanPrice = mean(sale_price, na.rm = T))
) 
ggplot(data = st_drop_geometry(nhoods_model)) +
  geom_point(aes(x = meanPrice, y = mean.MAPE), size = 2, color = "#E46726", alpha = 0.7) +
  geom_vline(aes(xintercept= 375000),
               color="#6D9EC1", linetype="dashed", size=.8, alpha = .8) +
  geom_hline(aes(yintercept= .4),
               color="#6D9EC1", linetype="dashed", size=.8, alpha = .8) +
  labs(title = "MAPE as a function of mean price by neighborhood",
       x = "Mean Price by Neighborhood",
       y = "MAPE by Neighborhood",
       caption = "Figure 4.7") +
  theme_minimal() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.caption = element_text(hjust = 0)) +
  annotate(geom = "text", label = paste("(275000,0.4)"),
    x = 275000, y = 0.4, 
    hjust = -.8, vjust = -1, size = 4, color = "#6D9EC1", alpha = .8)

Upon further examining the relationship between price and MAPE across neighborhoods, it is evident that MAPE decreases as the mean price increases up to $375,000, but starts to rise when the mean price exceeds this threshold. In our model, mean prices between $125,000 and $500,000 are associated with relatively lower MAPE. The spatial pattern of MAPE (see Figure 4.6) also shows that lower-priced, declining areas tend to have higher MAPE. These two graphs highlight that our model is more effective in neighborhoods with mid-range prices and that the socioeconomic context of neighborhoods impacts pricing.

ggplot() + 
  geom_sf(data = tracts22, fill = "transparent", color = "grey") +
  geom_sf(data = na.omit(tracts22), aes(fill = incomeContext), color = "grey90") +
  scale_fill_manual(values = c("#E46726", "#6D9EC1"), name="Income Context") +
  labs(title = "Tracts grouped by income",
       subtitle = "Philadelphia, 2022",
       caption = "Figure 4.8") +
  mapTheme() + 
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

income_table <- st_join(philly.test, tracts22) %>%
  group_by(incomeContext) %>%
  summarise(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  dplyr::select(-"<NA>") %>%
  mutate('No Context' = scales::percent(MAPE))

income_table %>%
  knitr::kable(caption = "<b>Table 4.5 Test set MAPE by tract income context</b>", align = "ccc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover"))

**Table 4.5 Test set MAPE by tract income context**
High Income	Low Income	No Context
24%	43%	36%

At the urban scale, we compared MAPE across different income contexts to assess the generalizability of our model. Tracts with median incomes above the citywide mean were designated as ‘High Income’, effectively dividing Philadelphia into two categories. The nearly 1:2 ratio in MAPE highlights a significant difference in the model’s goodness of fit, indicating higher accuracy in predicting housing prices in high-income areas. The neighborhoods with high MAPE (see Figure 4.6) fall predominantly within the low-income group, suggesting a connection between race, crime, income, and housing prices.

4.6 Challenge time

Using our model, the predicted values for where ‘toPredict’ is both “MODELLING” and “CHALLENGE” are as follows.

philly_finish <- philly %>%
   mutate(sale_price.Predict = predict(reg.training, philly))

ggplot() +
  geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = T) +
  geom_sf(data = philly_finish %>% filter(!is.na(sale_price.Predict)), aes(colour = q5(sale_price.Predict), na.rm = T),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(philly_finish,"sale_price.Predict"),
                   name="Quintile Breaks") +
  guides(color = guide_legend(override.aes = list(size = 3))) +
  labs(title="Predicted Values of 'Modelling' and 'Challenge'",
       subtitle = "Philadelphia, 2022",
       caption = "Figure 4.9") +
  mapTheme() +
  theme(plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12),
        panel.border = element_rect(colour = "black", fill=NA, size=1))

5 Discussion

Our model effectively predicts 76% of the variation in housing prices, particularly with high accuracy for sales priced below $1,000,000. Our logarithmic transformation and piecewise splitting of distance to city hall is both interesting and essential to the model. Other key features such as distance to the city center, neighborhood names, and zoning numbers significantly reduce prediction errors by approximately 20% when included in training. The Mean Absolute Error is $65671.32 and the Mean Absolute Percent Error is 35.92%. Prediction errors remain high due to the dataset’s wide price variation and unaccounted factors influencing spatial dynamics. From the perspective of urban context, the model performs poorly in North Philadelphia, where clustering of Hispanic and Black residents presents challenges, and in West Philadelphia, where rising crime and gentrification add complexity. These decaying factors contribute to lower housing prices and a more intricate spatial process. Nonetheless, the model performs well in neighborhoods with mid-range housing prices, where the larger dataset provides stronger training and testing opportunities.

6 Conclusion

I would recommend our model to Zillow for customers looking to predict mid-range housing prices in Philadelphia. The model is well-trained and accounts for internal characteristics, public services/amenities, and spatial features, demonstrating a strong fit within the context of Philadelphia. While there is some spatial autocorrelation present and OLS regression has its limitations, we plan to explore additional socioeconomic factors that influence the spatial process of housing prices, and apply non-linear regression and other machine learning techniques to further enhance the model.

References

Aziz, Asad, Muhammad Mushahid Anwar, Hazem Ghassan Abdo, Hussein Almohamad, Ahmed Abdullah Al Dughairi, and Motrih Al-Mutiry. 2023. “Proximity to Neighborhood Services and Property Values in Urban Area: An Evaluation Through the Hedonic Pricing Model.” Land 12 (4): 859.

Li, Sheng, Yi Jiang, Shuisong Ke, Ke Nie, and Chao Wu. 2021. “Understanding the Effects of Influential Factors on Housing Prices by Combining Extreme Gradient Boosting and a Hedonic Price Model (XGBoost-HPM).” Land 10 (5): 533.

Lin, Yun-Wei, Yi-Bing Lin, Chun-You Liu, Jiun-Yi Lin, and Yu-Lin Shih. 2019. “Implementing AI as Cyber IoT Devices: The House Valuation Example.” IEEE Transactions on Industrial Informatics 16 (4): 2612–20.

Mayor, Karen, Seán Lyons, David Duffy, and Richard SJ Tol. 2009. “A Hedonic Analysis of the Value of Parks and Green Spaces in the Dublin Area.” ESRI working paper.

Scotchmer, Suzanne. 1985. “Hedonic Prices and Cost/Benefit Analysis.” Journal of Economic Theory 37 (1): 55–75.

Zaki, John, Anand Nayyar, Surjeet Dalal, and Zainab H Ali. 2022. “House Price Prediction Using Hedonic Pricing Model and Machine Learning Techniques.” Concurrency and Computation: Practice and Experience 34 (27): e7342.

Predicting Housing Prices

Philadelphia, 2022

Guangze Sun & Luming Xu

2024-10-11