knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, cache = FALSE, fig.show = "asis",
fig.align = "center")
options(scipen=999)
# Load some libraries
library(tidyverse)
library(tidycensus)
library(stringr)
library(dplyr)
library(lubridate)
library(sf)
library(stargazer)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(corrr)
library(kableExtra)
library(jtools)
library(ggstance)
library(ggpubr)
library(broom.mixed)
library(glue)
library(classInt)
library(viridis)
library(patchwork)
library(corrplot)
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
palette5 <- colorRampPalette(c("#6D9EC1", "#E46726"))(5)
An improved prediction model for home prices helps stakeholders and government agencies in making better housing policy and marketing decisions. For Zillow, one of the leading real estate companies in United States, developing an accurate, generalizable model for home prices is crucial to providing effective services to customers and enhancing the company’s competitiveness in the industry. However, this is a challenging task as numerous factors affect home prices, and no single model could cover all of them. This study seeks to create a more comprehensive home price model for Philadelphia, aiming to increase both accuracy and generalizability to benefit homebuyers, investors, the company and policymakers.
To achieve this, we applied the Hedonic Pricing Model, which accounts for internal characteristics, amenities/public services and spatial structure to determine property prices. Ordinary Least Square (OLS) regression forms the core of the model, following data cleaning, filtering and reclassification. Tests are conducted to evaluate the accuracy and generalizability of the model.
Overall, our model takes a step forward by including local characteristics like citywide spatial structures and neighborhood contexts, effectively predicts 76% of the variation in housing prices, achieves high accuracy for sales priced below $1,000,000, and controls mean absolute percentage error (MAPE) to about 36% without introducing more complex algorithms. However, the model shows less consistency and slight spatial autocorrelation, suggesting unaccounted factors. At the urban scale, North and West Philadelphia demonstrate poor predictions, indicating less generalizability towards low-priced neighborhoods and low-income tracts. Our model could be applied for predicting mid-range housing prices. Additional socioeconomic factors and non-linear regression techniques could be explored to develop the model.
Before modeling, we processed the data through data gathering from open portals, exploratory data analysis, and feature engineering.
We reviewed previous studies to identify possible explanatory variables. Following the Hedonic Pricing Model, we sourced internal characteristics from the original dataset, gathered data on amenities (access to various spaces and services) through Open Data Philly portal, and accessed census data to grasp local contexts. Both continuous and categorial data are incorporated, and we chose the average distance to the 5 nearest crimes as an indicator of crime rate after comparing various metrics.(Zaki et al. 2022; Lin et al. 2019; Li et al. 2021; Aziz et al. 2023)
Data Sources: Philly neighborhoods, housing sale prices (studentData), crime incidents, hydrology, food retail access, zoning, census tables
# philly$crimes.Buffer <- philly %>%
# st_buffer(660) %>%
# aggregate(mutate(phillyCrimes, counter = 1),., sum) %>%
# pull(counter)
philly <-
philly %>%
mutate(
# crime_nn1 = nn_function(st_coordinates(philly),
# st_coordinates(phillyCrimes), k = 1),
#
# crime_nn2 = nn_function(st_coordinates(philly),
# st_coordinates(phillyCrimes), k = 2),
#
# crime_nn3 = nn_function(st_coordinates(philly),
# st_coordinates(phillyCrimes), k = 3),
#
# crime_nn4 = nn_function(st_coordinates(philly),
# st_coordinates(phillyCrimes), k = 4),
crime_nn5 = nn_function(st_coordinates(philly),
st_coordinates(phillyCrimes), k = 5))
# city_hall <- metro[metro$Station == 'City Hall', 6]
# school_elem <- school %>%
# filter(str_detect(GRADE_LEVEL, "ELEMENTARY"))
#
# census <- census %>%
# mutate(across(3:8, ~ as.numeric(as.character(.)))) %>%
# mutate(NAMELSAD = str_extract(NAME, "^[^;]+")) %>%
# select(NAMELSAD, 3:8)
#
# calculate_nearest_distance <- function(set_points, other_layer) {
# nearest_idx <- st_nearest_feature(set_points, other_layer)
# st_distance(set_points, other_layer[nearest_idx, ], by_element = TRUE) %>% as.numeric()
# }
# set2 <- studentData[,74:76] %>%
# mutate(distance_to_city_hall = st_distance(., city_hall) %>% as.numeric()) %>%
# mutate(
# distance_to_nearest_metro = calculate_nearest_distance(geometry, metro),
# distance_to_nearest_hospital = calculate_nearest_distance(geometry, hospital),
# distance_to_nearest_school = calculate_nearest_distance(geometry, school_elem),
# distance_to_nearest_park = calculate_nearest_distance(geometry, park),
# distance_to_nearest_water = calculate_nearest_distance(geometry, water)
# )
# set2 <- set %>%
# st_join(census_tract["NAMELSAD"]) %>%
# left_join(census, by = "NAMELSAD") %>%
# st_join(retail[, c("LPSS_PER1000", "HPSS_PER1000")])
set <- read.csv("C:/Data/Fall 2024/CPLN 5920/Code/PPA/Predicting Housing Prices/data/set.csv")
philly_set <- merge(philly, set, by = "musaID", all.x = FALSE, all.y = FALSE, sort = FALSE) %>%
dplyr::select(-toPredict.y) %>%
rename(toPredict = toPredict.x) %>%
mutate(dist1 = ifelse(distance_to_city_hall <= 20000, distance_to_city_hall, 20000))
philly <- philly_set %>% select(-distance_to_nearest_metro, -distance_to_nearest_hospital, -distance_to_nearest_school, -distance_to_nearest_park, -NAMELSAD, -Percent_Bachelor.s.degree.or.higher, -Percent_Management..business..science..and.arts.occupations, -Percent_Different.state, -Percent_With.private.health.insurance, -HPSS_PER1000)
model_philly <- philly %>%
filter(toPredict == "MODELLING")
ggplot() +
geom_sf(data = nhoods, fill = "grey40", color = "grey") +
geom_sf(data = model_philly, aes(colour = q5(sale_price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels = qBr(model_philly, "sale_price"),
name="Quintile Breaks") +
guides(color = guide_legend(override.aes = list(size = 3))) +
labs(title="Sale Price",
subtitle = "Philadelphia, 2022",
caption = "Figure 2.1") +
mapTheme() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12),
panel.border = element_rect(colour = "black", fill=NA, size=1))
Sale price of homes is the dependent variable to be predicted, and Figure 2.1 showed its spatial distribution across Philadelphia. Higher prices cluster in center city and the far northeast and northwest suburbs, while lower prices concentrate in west Philadelphia and north Philadelphia. This concentric pattern reflects how the city has developed, where the wealthier residents moved to the suburbs, leaving the middle ring less developed.
stat_philly <- philly %>%
dplyr::select(-musaID, -sale_price)
stat_philly_data <- st_drop_geometry(stat_philly)
descriptions$N <- sapply(descriptions$Field.Name, function(field) {
sum(!is.na(stat_philly_data[[field]]))
})
selected_vars <- descriptions$Field.Name[descriptions$Variable.Type %in% c("Categorical*", "Continuous")]
descriptions <- descriptions %>%
rowwise() %>%
mutate(
Mean = ifelse(Field.Name %in% selected_vars, round(mean(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA),
StDev = ifelse(Field.Name %in% selected_vars, round(sd(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA),
Min = ifelse(Field.Name %in% selected_vars, round(min(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA),
Max = ifelse(Field.Name %in% selected_vars, round(max(stat_philly_data[[Field.Name]], na.rm = TRUE), 3), NA)
) %>%
ungroup()
descriptions %>%
knitr::kable(caption = "<b>Table2.1 Summary Statistics</b>", align = "ccclccccc") %>%
kableExtra::kable_material(lightable_options = c("striped", "hover")) %>%
scroll_box(height = "300px", width = "800px")
Category | Field.Name | Variable.Type | Description | N | Mean | StDev | Min | Max |
---|---|---|---|---|---|---|---|---|
Internal | building_code_description_new | Categorical | Building style. | 23881 | NA | NA | NA | NA |
Internal | central_air | Categorical | “Y” indicates there is central air. | 23881 | NA | NA | NA | NA |
Internal | general_construction | Categorical | General construction. | 23881 | NA | NA | NA | NA |
Internal | quality_grade | Categorical | Quality grade related to building workmanship and materials. | 23881 | NA | NA | NA | NA |
Internal | separate_utilities | Categorical | Whether heater, hot water tank, electrical service, and gas service are central, part separate, or separate. | 23881 | NA | NA | NA | NA |
Internal | topography | Categorical | Above street level, below street level, flood plain, rocky, or other. | 23881 | NA | NA | NA | NA |
Internal | zoning | Categorical | A code identifying the legal uses permitted on the property. | 23881 | NA | NA | NA | NA |
Internal | exterior_condition | Categorical* | Indicates the observed exterior condition of the property. | 23880 | 3.764 | 0.885 | 1.000 | 7.000 |
Internal | fireplaces | Categorical* | Indicates the number of fireplaces. | 23673 | 0.048 | 0.291 | 0.000 | 5.000 |
Internal | garage_spaces | Categorical* | Indicates the number of garage spaces. | 23457 | 0.357 | 0.705 | 0.000 | 72.000 |
Internal | interior_condition | Categorical* | Represents the overall condition of the interior. | 23873 | 3.647 | 0.890 | 0.000 | 7.000 |
Internal | number_of_bathrooms | Categorical* | Number of bathrooms. | 23622 | 1.072 | 0.676 | 0.000 | 8.000 |
Internal | number_of_bedrooms | Categorical* | Number of bedrooms. | 23865 | 2.583 | 1.266 | 0.000 | 31.000 |
Internal | frontage | Continuous | The width of the lot where it abuts the principal street. | 23751 | 19.653 | 19.419 | 0.000 | 1307.000 |
Internal | total_livable_area | Continuous | Total livable area. | 23881 | 1333.633 | 548.834 | 0.000 | 15246.000 |
Amenities | crime_nn5 | Continuous | Average distance to the 5 nearest crimes (in ft) from the site. | 23881 | 142.530 | 73.134 | 18.440 | 1772.141 |
Amenities | dist1 | Continuous | Distance to city hall (in ft) from the site if less than 20000 ft, otherwise 20000 ft. | 23881 | 16519.527 | 5196.091 | 1400.739 | 20000.000 |
Amenities | distance_to_city_hall | Continuous | Distance to city hall (in ft) from the site. | 23881 | 27504.777 | 17484.769 | 1400.739 | 79657.714 |
Amenities | distance_to_nearest_water | Continuous | Distance to the nearest water body (in ft) from the site. | 23881 | 3436.259 | 2039.120 | 19.850 | 9589.470 |
Amenities | LPSS_PER1000 | Continuous | Number of low-produce supply stores per 1000 people. | 23879 | 28.196 | 24.076 | 0.000 | 224.615 |
Spatial | NAME | Categorical | Name of the neighborhood. | 23881 | NA | NA | NA | NA |
Spatial | Estimate_Mean.family.income..dollars. | Continuous | Mean family income (in dollars) in the census tract where the site is located. | 23881 | 103220.610 | 61176.718 | 25294.000 | 427084.000 |
Spatial | Percent_White.alone | Continuous | Percentage of white (non-Hispanic) population in the census tract where the site is located. | 23881 | 37.359 | 30.115 | 0.000 | 95.400 |
*Note: Transformed from continuous into categorical in the regression model.
The Hedonic Pricing Model divides the total cost of property into separate costs for each characteristic according to different subgroups.(Mayor et al. 2009; Scotchmer 1985) In this study, twenty-two variables are selected for the model, as summarized in Table 2.1. These include internal features like total livable area or exterior condition, amenities like crime and distance to nearest water body, as well as spatial contexts like mean family income in the census tract. The dataset’s average home has 1,333 square feet of livable space, 2.58 bedrooms and 1.07 bathrooms, with its cost at about $270,000.
numericVars <-
select_if(st_drop_geometry(model_philly), is.numeric) %>% na.omit() %>%
dplyr::select(-musaID)
cor_matrix <- cor(numericVars, use = "complete.obs")
corrplot(
cor_matrix,
method = "color",
col = colorRampPalette(c("#6D9EC1", "white", "#E46726"))(200),
type = "lower",
tl.pos = "l",
tl.col = "black",
tl.cex = 0.7,
mar = c(0, 0, 3, 0)
)
title(main = "Correlation across numeric variables", cex.main = 1.5, font.main = 2)
mtext("Figure 2.2", side = 1, line = 4, adj = 0.2, cex = .8)
Correlation of sale price and other numeric variables are presented in Figure 2.2. Mean family income shows the strongest correlation with sale price (coefficient = 0.645), followed by total livable area. Exterior and interior conditions have the highest negative correlations. Some variables exhibit high autocorrelation, which may affect interpretability of the coefficients in the final model.
# ggplot(gather(numericVars), aes(value)) +
# geom_histogram(bins = 50) +
# facet_wrap(~key,nrow=4, scales = 'free_x') +
# theme_minimal() +
# theme(axis.text = element_blank(), axis.ticks = element_blank())
# all numeric variables
# model_philly_numeric <- st_drop_geometry(model_philly) %>%
# dplyr::select(sale_price, crime_nn5,
# frontage, total_livable_area,
# distance_to_city_hall, distance_to_nearest_water,
# Estimate_Mean.family.income..dollars.,Percent_White.alone,LPSS_PER1000
# ) %>%
# filter(sale_price <= 1000000) %>%
# gather(Variable, Value, -sale_price)
#
# ggplot(model_philly_numeric, aes(Value, sale_price)) +
# geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#E46726") +
# facet_wrap(~Variable, ncol = 4, scales = "free") +
# labs(title = "Price as a function of continuous variables") +
# theme_minimal() +
# theme(axis.text = element_blank(), axis.ticks = element_blank())
# all factor variables
# model_philly_factor <- st_drop_geometry(model_philly) %>%
# dplyr::select(sale_price,
# central_air, exterior_condition, fireplaces, garage_spaces, general_construction,
# interior_condition, number_of_bathrooms, number_of_bedrooms, quality_grade, separate_utilities,
# topography, zoning, building_code_description_new) %>%
# mutate(exterior_condition = as.factor(exterior_condition),
# fireplaces = as.factor(fireplaces),
# garage_spaces = as.factor(garage_spaces),
# interior_condition = as.factor(interior_condition),
# number_of_bathrooms = as.factor(number_of_bathrooms),
# number_of_bedrooms = as.factor(number_of_bedrooms)) %>%
# filter(sale_price <= 1000000) %>%
# gather(Variable, Value, -sale_price)
#
# ggplot(model_philly_factor, aes(Value, sale_price)) +
# geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
# facet_wrap(~Variable, ncol = 4, scales = "free") +
# labs(title = "Price as a function of categorical variables", y = "Mean Price") +
# theme_minimal() +
# theme(axis.text = element_blank(), axis.ticks = element_blank())
## Crime cor
# model_philly %>%
# st_drop_geometry() %>%
# dplyr::select(sale_price, starts_with("crime")) %>%
# filter(sale_price <= 1000000) %>%
# gather(Variable, Value, -sale_price) %>%
# ggplot(aes(Value, sale_price)) +
# geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#E46726") +
# facet_wrap(~Variable, nrow = 1, scales = "free") +
# labs(title = "Price as a function of crimes") +
# theme_minimal()
Transformations of variables can sometimes improve model performance. Three types of transformations are applied in this study: logarithmic transformation, reclassification, and piecewise splitting, with examples shown below.
model_philly_numeric <- st_drop_geometry(model_philly) %>%
mutate(total_livable_area_log = log(total_livable_area + 1)) %>%
dplyr::select(total_livable_area, total_livable_area_log) %>%
filter(total_livable_area <= 2000000)
plot1 <- ggplot(model_philly_numeric, aes(x = total_livable_area)) +
geom_histogram(bins = 30, fill = "#6D9EC1", color = "black", alpha = 0.5) +
labs(title = "Original Total Livable Area") +
xlim(0, 7500) +
coord_cartesian(ylim = c(0, 8000)) +
theme_minimal() +
theme(
plot.title = element_text(size = 12, hjust = 0.5),
axis.title.x = element_blank()
)
plot2 <- ggplot(model_philly_numeric, aes(x = total_livable_area_log)) +
geom_histogram(bins = 30, fill = "#6D9EC1", color = "black", alpha = 0.5) +
labs(title = "Log-transformed Total Livable Area") +
xlim(5, 9) +
coord_cartesian(ylim = c(0, 8000)) +
theme_minimal() +
theme(
plot.title = element_text(size = 12, hjust = 0.5),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)
combined_plot <- plot1 + plot2 +
plot_layout(widths = c(1, 1)) +
plot_annotation(
title = "Distribution of Total Livable Area",
caption = "Figure 2.3",
theme = theme(plot.title = element_text(size = 18, face = "bold"),
plot.caption = element_text(hjust = 0))
)
combined_plot
Figure 2.3 shows a typical use of log-transformation to the variable total livable area, making a right-skewed distribution more normal, fitting the assumptions of the OLS framework better.
philly_f <- philly %>%
mutate(
exterior_condition = case_when(
is.na(exterior_condition) | exterior_condition == 1 ~ "1",
exterior_condition %in% c(2, 3) ~ "2",
exterior_condition == 4 ~ "3",
TRUE ~ "4"
),
exterior_condition = factor(exterior_condition)
)
count_data <- philly %>%
group_by(exterior_condition) %>%
summarise(count = n())
mean_price_data <- philly %>%
group_by(exterior_condition) %>%
summarise(mean_sale_price = mean(sale_price, na.rm = TRUE))
count_data_f <- philly_f %>%
group_by(exterior_condition) %>%
summarise(count = n())
mean_price_data_f <- philly_f %>%
group_by(exterior_condition) %>%
summarise(mean_sale_price = mean(sale_price, na.rm = TRUE))
plot1 <- ggplot(count_data, aes(x = exterior_condition, y = count)) +
geom_bar(stat = "identity", fill = "#6D9EC1", alpha = 0.8, width = 0.6) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(size = 12),
axis.title.x = element_blank()
)
plot2 <- ggplot(mean_price_data, aes(x = exterior_condition, y = mean_sale_price)) +
geom_bar(stat = "identity", fill = "#E46726", alpha = 0.8, width = 0.6) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(size = 12),
axis.title.x = element_blank()
)
plot3 <- ggplot(count_data_f, aes(x = exterior_condition, y = count)) +
geom_bar(stat = "identity", fill = "#6D9EC1", alpha = 0.8, width = 0.6) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(size = 12),
axis.title.x = element_blank()
)
plot4 <- ggplot(mean_price_data_f, aes(x = exterior_condition, y = mean_sale_price)) +
geom_bar(stat = "identity", fill = "#E46726", alpha = 0.8, width = 0.6) +
theme_minimal() +
theme(
legend.position = "none",
plot.title = element_text(size = 12),
axis.title.x = element_blank()
)
combined_plot <- (plot1 | plot2) / (plot3 | plot4) +
plot_annotation(
title = "Exterior Condition and Sale Price",
subtitle = "Before (Left) and After (Right) Reclassification",
caption = "Figure 2.4",
theme = theme(plot.title = element_text(size = 18, face = "bold"),
plot.caption = element_text(hjust = 0))
)
combined_plot
Reclassification includes converting a numeric variable into categorial one, and recoding it into fewer categories, as Figure 2.4. This applies to variables that have too few observations in some categories or have mere differences across categories, making the model more robust.
model_philly_numeric <- st_drop_geometry(model_philly) %>%
dplyr::select(sale_price, distance_to_city_hall) %>%
filter(sale_price <= 1000000) %>%
mutate(DistanceCategory = ifelse(distance_to_city_hall <= 20000, "<= 20000", "> 20000"))
plot1 <- ggplot(subset(model_philly_numeric, DistanceCategory == "<= 20000"),
aes(distance_to_city_hall, sale_price)) +
geom_point(size = 0.5, color = "#6D9EC1", alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "#E46726") +
ylab("Sale Price") +
ggtitle("<= 20000") + # 添加小标题
scale_x_continuous(breaks = seq(0, 20000, by = 10000)) +
theme_minimal() +
theme(axis.title.x = element_blank(),
plot.title = element_text(size = 12, hjust = 0.5))
plot2 <- ggplot(subset(model_philly_numeric, DistanceCategory == "> 20000"),
aes(distance_to_city_hall, sale_price)) +
geom_point(size = 0.5, color = "#6D9EC1", alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "#E46726") +
ggtitle("> 20000") + # 添加小标题
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(size = 12, hjust = 0.5))
combined_plot <- plot1 + plot2 +
plot_layout(widths = c(1, 4)) +
plot_annotation(
title = "Price as a function of Distance to City Hall",
caption = "Figure 2.5",
theme = theme(plot.title = element_text(size = 18, face = 'bold'),
plot.caption = element_text(hjust = 0))
)
combined_plot
Cutting continuous variables into pieces is a way of dealing with non-linear relationships, as is shown in Figure 2.5. The sale price sharply decreases and slightly goes up as distance to city hall increases, reflecting the concentric development structure discussed in Figure 2.5. In this case, we added a “dist1” variable, assigning actual values if its distance is less than 20000 ft, or 20000 ft otherwise. This transformation will help achieve the effect of piecewise function in linear regressions.
philly_fac <- philly %>%
mutate(
# basements = case_when(
# basements %in% c("0", "1", "4", "A") ~ "B",
# TRUE ~ "A"
# ),
# basements = factor(basements),
central_air = case_when(
central_air %in% c("1", "Y") ~ "Y",
TRUE ~ "N"
),
central_air = factor(central_air),
exterior_condition = case_when(
is.na(exterior_condition) | exterior_condition == 1 ~ "1",
exterior_condition %in% c(2, 3) ~ "2",
exterior_condition == 4 ~ "3",
TRUE ~ "4"
),
exterior_condition = factor(exterior_condition),
fireplaces = case_when(
is.na(fireplaces) | fireplaces == 0 | fireplaces == 1 ~ "0",
TRUE ~ "1"
),
fireplaces = factor(fireplaces),
# fuel = case_when(
# is.na(fuel) | fuel == "" ~ "X",
# TRUE ~ "A"
# ),
# fuel = factor(fuel),
garage_spaces = case_when(
is.na(garage_spaces) | garage_spaces == 0 ~ "0",
garage_spaces == 1 ~ "1",
TRUE ~ "2"
),
garage_spaces = factor(garage_spaces),
general_construction = case_when(
general_construction %in% c("A ", "B ", "D ") ~ "A",
general_construction == "3 " ~ "C",
TRUE ~ "B"
),
general_construction = factor(general_construction),
interior_condition = case_when(
interior_condition %in% c(0, 1, 2, 3) ~ "1",
interior_condition %in% c(4, NA) ~ "2",
TRUE ~ "3"
),
interior_condition = factor(interior_condition),
number_of_bathrooms = case_when(
number_of_bathrooms == 0 ~ "0",
number_of_bathrooms == 1 ~ "1",
number_of_bathrooms %in% c(2, NA) ~ "2",
number_of_bathrooms == 3 ~ "3",
TRUE ~ "4"
),
number_of_bathrooms = factor(number_of_bathrooms),
number_of_bedrooms = case_when(
number_of_bedrooms %in% c(0, 31) | is.na(number_of_bedrooms) ~ "0",
number_of_bedrooms %in% c(1, 2) ~ "1",
number_of_bedrooms == 3 ~ "2",
number_of_bedrooms == 4 ~ "3",
number_of_bedrooms %in% c(5, 6) ~ "4",
number_of_bedrooms %in% c(7, 8, 9) ~ "5"
),
number_of_bedrooms = factor(number_of_bedrooms),
# parcel_shape = case_when(
# parcel_shape %in% c("C", "E") ~ "A",
# parcel_shape == "A" ~ "B",
# TRUE ~ "C"
# ),
# parcel_shape = factor(parcel_shape),
quality_grade = case_when(
quality_grade %in% c("A+", "A", "A-", "X", "X-") ~ "1",
quality_grade %in% c("B+", "B", "B-", "3", "S+") ~ "2",
TRUE ~ "3"
),
quality_grade = factor(quality_grade),
separate_utilities = case_when(
separate_utilities %in% c("A", "C") ~ "B",
TRUE ~ "A"
),
separate_utilities = factor(separate_utilities),
# street_direction = case_when(
# is.na(street_direction) | street_direction == "" | street_direction == "S" ~ "A",
# TRUE ~ "B"
# ),
# street_direction = factor(street_direction),
topography = case_when(
topography == "B" ~ "B",
TRUE ~ "A"
),
topography = factor(topography),
# type_heater = case_when(
# type_heater == "D" ~ "B",
# TRUE ~ "A"
# ),
# type_heater = factor(type_heater),
building_code_description_new = case_when(
building_code_description_new %in% c("COLONIAL", "OLD STYLE", "ROW MODERN", "ROW OLD STYLE", "TUDOR", "TWIN BUNGALOW") ~ "1",
building_code_description_new %in% c("ROW POST WAR", "ROW TYPICAL", "TWIN CONVENTIONAL") ~ "3",
building_code_description_new %in% c("OTHER", "ROW PORCH FRONT") ~ "4",
TRUE ~ "2"
),
building_code_description_new = factor(building_code_description_new),
NAME = case_when(
NAME %in% c("CHINATOWN", "CRESTMONT_FARMS", "EAST_PARK", "MECHANICSVILLE",
"PENNYPACK_PARK", "UNIVERSITY_CITY", "WISSAHICKON_PARK", "WOODLAND_TERRACE") ~ "OTHERS",
TRUE ~ NAME
),
NAME = factor(NAME)
)
philly <- philly_fac
model_philly <- philly %>%
filter(toPredict == "MODELLING")
st_drop_geometry(model_philly) %>%
dplyr::select(sale_price, total_livable_area, crime_nn5, distance_to_city_hall, Estimate_Mean.family.income..dollars.) %>%
filter(sale_price <= 2000000 & crime_nn5 <= 1000 & total_livable_area <= 7500) %>%
gather(Variable, Value, -sale_price) %>%
ggplot(aes(Value, sale_price)) +
geom_point(size = 0.5, colour = "#6D9EC1", alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, colour = "#E46726") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Price as a function of continuous variables",
caption = "Figure 2.6") +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold"),
plot.caption = element_text(hjust = 0)
)
Figure 2.6 includes scatterplots for sale price against four selected continuous variables. Besides “distance to city hall” we have addressed, positive relationships between sale price and the other three variables are explicit. The points in the plots are dispersed rather than being around the line, showing the sense of Hedonic Pricing Model’s aim to value all services to define a total price, reducing the variance from any single variable.
ggplot() +
geom_sf(data = nhoods, , fill = "grey40", color = "grey", na.rm = T) +
geom_sf(data = philly %>% filter(!is.na(LPSS_PER1000)), aes(colour = q5(LPSS_PER1000)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(philly,"LPSS_PER1000"),
name="Quintile\nBreaks") +
guides(color = guide_legend(override.aes = list(size = 3))) +
labs(title="Low-Produce Supply Stores",
subtitle = "per 1000 People",
caption = "Figure 2.7") +
mapTheme() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12),
panel.border = element_rect(colour = "black", fill=NA, size=1))
Low-produce supply stores (LPSS) include dollar stores, pharmacies and convenience stores. It is a measure of accessibility to affordable stores, having high density in areas with lower income in north and west Philadelphia, as shown in Figure 2.7. When aligned with other variables, LPSS distinguishes these neighborhoods well and enhances model accuracy.
ggplot() +
geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = T) +
geom_sf(data = philly %>% filter(!is.na(exterior_condition)), aes(colour = exterior_condition),
show.legend = "point", size = .75) +
scale_colour_manual(values = colorRampPalette(c("#E46726", "#6D9EC1"))(4)
) +
guides(color = guide_legend(override.aes = list(size = 3))) +
labs(title="Exterior Condition",
caption = "Figure 2.8") +
mapTheme() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12),
panel.border = element_rect(colour = "black", fill=NA, size=1))
The spatial distribution of exterior conditions is as Figure 2.8 (1 = best, 4 = worst). Homes with the best exterior conditions are close to the center city, and west Philadelphia shows a mix of both well- and ill-conditioned homes, instead of corresponding only with the wealthiness of the neighborhoods.
ggplot() +
geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = TRUE) +
geom_sf(data = philly %>% filter(!is.na(zoning)), aes(colour = zoning),
show.legend = "point", size = 0.75) +
scale_colour_manual(values = c(
"RM1" = "#FCB951", "RM2" = "#FCB951", "RM3" = "#FCB951", "RM4" = "#FCB951",
"RMX1" = "#E58425", "RMX2" = "#E58425", "RMX3" = "#E58425",
"RSA1" = "#F8EE68", "RSA2" = "#F8EE68", "RSA3" = "#F8EE68", "RSA4" = "#F8EE68",
"RSA5" = "#F8EE68", "RSA6" = "#F8EE68",
"RSD1" = "#FFF5C4", "RSD2" = "#FFF5C4", "RSD3" = "#FFF5C4",
"RTA1" = "#CDB650"
)) +
guides(color = guide_legend(override.aes = list(size = 3), ncol = 2)) +
labs(title = "Zoning", subtitle = "Philadelphia, 2022",
caption = "Figure 2.9") +
mapTheme() +
theme(
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12),
panel.border = element_rect(colour = "black", fill = NA, size = 1)
)
Figure 2.9 visualizes the zoning status, colored in the official colors. The most common type of zoning is RSA (Residential Single-Family Attached). RM (Residential Multi-Family), with higher density, are clustered along Board St, near center city and a couple of areas in west Philadelphia, while RSD (Residential Single-Family Detached), with lower density, are spread across far northwest and northeast suburbs.
# cor_philly <-
# st_drop_geometry(model_philly) %>%
# filter(sale_price <= 1000000,
# total_livable_area < 10000) %>%
# dplyr::select(sale_price, total_livable_area)
#
# cor.test(cor_philly$total_livable_area,
# cor_philly$sale_price,
# method = "pearson")
#
#
# # Calculate predicted prices
# cor_philly$predicted_price <- predict(lm(sale_price ~ total_livable_area, data = cor_philly), newdata = cor_philly)
# ggscatter(cor_philly,
# x = "total_livable_area",
# y = "sale_price",
# color = "#6D9EC1", size = 1, alpha = 0.6) +
# geom_point(aes(y = predicted_price), color = "#E46726", size = 1, alpha = 0.6) +
# stat_cor(label.x = 4000, label.y = 250000, hjust = 0) +
# labs(title = "Price as a function of living area",
# subtitle = "With predicted prices; Sale prices <= $1 mil.",
# x = "Total Livable Area",
# y = "Sale Price") +
# theme_minimal() +
# theme(plot.title = element_text(size = 18, face = "bold"),
# plot.subtitle = element_text(size = 12))
The core of our algorithm is Ordinary Least Square (OLS) regression model. We used 9 continuous variables and 13 reclassified categorical variables as independent variables and sale price as the dependent variable.
To build and evaluate our model, we split the dataset into two parts: 60% as the training set to fit the model, and 40% as the test set to evaluate its accuracy. The model was tested using Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and prediction v.s. actual value scatter plot.
To ensure the accuracy of the model, we applied a 100-fold cross validation process. This divides the data into 100 subsets, repeatedly testing the model to see how well it performs across different samples. Finally, we examined the generalizability of the model across different neighborhood contexts by observing the distribution of spatial lag, calculating Moran’s I (a measure of spatial clustering in model errors), and testing the model in different neighborhood settings.
Analyses were conducted using R version 4.4.1.
Using the features selected above, our model explains 76% of the variation in price and produces average errors of 36% relative to housing prices. In terms of cross-validation, our model was well-trained on the available variables and shows high accuracy for low-priced houses. However, it demonstrates less consistency and a slight positive spatial autocorrelation, suggesting the influence of unaccounted factors. At the urban scale, our model is less effective at predicting prices in low-income neighborhoods and reveals a significant discrepancy between high- and low-income areas, indicating a further development in generalizability.
To improve the model’s accuracy and generalizability, we randomly split the dataset into 60% for training the model and 40% for testing its goodness of fit. Using the selected features, OLS regression was performed on the training set.
set.seed(34)
inTrain <- createDataPartition(
y = paste(model_philly$general_construction,
model_philly$topography,
model_philly$quality_grade,
model_philly$NAME),
p = .60, list = FALSE)
# [row,column] select all columns
philly.training <- model_philly[inTrain,]
philly.test <- model_philly[-inTrain,]
reg.training <- lm(sale_price ~ log(total_livable_area+1) + total_livable_area + crime_nn5 + LPSS_PER1000 +
log(dist1+1) +log(distance_to_city_hall+1) + log(distance_to_nearest_water+1) +
Estimate_Mean.family.income..dollars. + Percent_White.alone + log(frontage+1) +
central_air + exterior_condition + fireplaces + garage_spaces + general_construction +
interior_condition + number_of_bathrooms + number_of_bedrooms + quality_grade +
separate_utilities + topography + building_code_description_new + zoning +
NAME,
data = st_drop_geometry(philly.training))
kable(summary(reg.training)$coefficients, digits = 3,format = "html",caption = "<b>Table 4.1 OLS Regression Results</b>", align = "ccccc") %>%
kableExtra::kable_material(lightable_options = c("striped", "hover")) %>%
scroll_box(height = "300px")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 2602553.344 | 159428.947 | 16.324 | 0.000 |
log(total_livable_area + 1) | -53340.123 | 4227.568 | -12.617 | 0.000 |
total_livable_area | 203.405 | 3.891 | 52.272 | 0.000 |
crime_nn5 | 99.448 | 21.919 | 4.537 | 0.000 |
LPSS_PER1000 | -355.719 | 62.060 | -5.732 | 0.000 |
log(dist1 + 1) | -311331.292 | 28622.655 | -10.877 | 0.000 |
log(distance_to_city_hall + 1) | 84882.204 | 23567.098 | 3.602 | 0.000 |
log(distance_to_nearest_water + 1) | -4915.598 | 2243.416 | -2.191 | 0.028 |
Estimate_Mean.family.income..dollars. | 0.315 | 0.047 | 6.723 | 0.000 |
Percent_White.alone | 310.545 | 112.103 | 2.770 | 0.006 |
log(frontage + 1) | 32602.610 | 2292.147 | 14.224 | 0.000 |
central_airY | 18501.777 | 2988.092 | 6.192 | 0.000 |
exterior_condition2 | -53547.193 | 7303.736 | -7.331 | 0.000 |
exterior_condition3 | -68580.559 | 7492.304 | -9.153 | 0.000 |
exterior_condition4 | -91899.003 | 8674.360 | -10.594 | 0.000 |
fireplaces1 | 67068.917 | 13085.724 | 5.125 | 0.000 |
garage_spaces1 | 15842.471 | 2864.147 | 5.531 | 0.000 |
garage_spaces2 | 78115.398 | 8498.819 | 9.191 | 0.000 |
general_constructionB | -11270.490 | 3656.738 | -3.082 | 0.002 |
general_constructionC | 189225.510 | 20068.304 | 9.429 | 0.000 |
interior_condition2 | -42242.750 | 3470.206 | -12.173 | 0.000 |
interior_condition3 | -56030.547 | 6584.426 | -8.510 | 0.000 |
number_of_bathrooms1 | -18625.215 | 6977.063 | -2.669 | 0.008 |
number_of_bathrooms2 | 4351.026 | 7280.922 | 0.598 | 0.550 |
number_of_bathrooms3 | 55836.812 | 9239.940 | 6.043 | 0.000 |
number_of_bathrooms4 | 326526.524 | 16623.536 | 19.642 | 0.000 |
number_of_bedrooms1 | 22870.061 | 7307.230 | 3.130 | 0.002 |
number_of_bedrooms2 | 39637.760 | 6887.849 | 5.755 | 0.000 |
number_of_bedrooms3 | 15978.172 | 7511.094 | 2.127 | 0.033 |
number_of_bedrooms4 | -14766.514 | 10373.662 | -1.423 | 0.155 |
number_of_bedrooms5 | -290198.487 | 27328.456 | -10.619 | 0.000 |
quality_grade2 | -64882.420 | 19616.440 | -3.308 | 0.001 |
quality_grade3 | -74504.377 | 19450.009 | -3.831 | 0.000 |
separate_utilitiesB | -38831.626 | 8023.355 | -4.840 | 0.000 |
topographyB | 239582.260 | 34833.986 | 6.878 | 0.000 |
building_code_description_new2 | -15340.877 | 5805.798 | -2.642 | 0.008 |
building_code_description_new3 | -16182.590 | 5557.419 | -2.912 | 0.004 |
building_code_description_new4 | -19220.374 | 6137.199 | -3.132 | 0.002 |
zoningRM2 | 7746.895 | 14385.569 | 0.539 | 0.590 |
zoningRM3 | -9737.974 | 23710.849 | -0.411 | 0.681 |
zoningRM4 | -32379.920 | 16618.804 | -1.948 | 0.051 |
zoningRMX1 | 63005.260 | 24701.659 | 2.551 | 0.011 |
zoningRMX2 | 72013.347 | 41015.004 | 1.756 | 0.079 |
zoningRMX3 | -166525.907 | 13033.349 | -12.777 | 0.000 |
zoningRSA1 | -66667.854 | 24014.839 | -2.776 | 0.006 |
zoningRSA2 | -27635.835 | 9956.102 | -2.776 | 0.006 |
zoningRSA3 | -24865.513 | 6233.913 | -3.989 | 0.000 |
zoningRSA4 | -29916.407 | 15535.375 | -1.926 | 0.054 |
zoningRSA5 | 2523.187 | 3963.322 | 0.637 | 0.524 |
zoningRSA6 | 30188.602 | 34770.908 | 0.868 | 0.385 |
zoningRSD1 | -57179.970 | 16086.302 | -3.555 | 0.000 |
zoningRSD2 | -76399.095 | 24976.463 | -3.059 | 0.002 |
zoningRSD3 | -55811.352 | 10447.264 | -5.342 | 0.000 |
zoningRTA1 | -17908.382 | 17482.315 | -1.024 | 0.306 |
NAMEALLEGHENY_WEST | -29308.738 | 33195.930 | -0.883 | 0.377 |
NAMEANDORRA | -11064.666 | 30574.876 | -0.362 | 0.717 |
NAMEASTON_WOODBRIDGE | -4960.254 | 27215.358 | -0.182 | 0.855 |
NAMEBARTRAM_VILLAGE | 2026.783 | 44167.698 | 0.046 | 0.963 |
NAMEBELLA_VISTA | -9689.404 | 39012.675 | -0.248 | 0.804 |
NAMEBELMONT | -96596.983 | 39534.258 | -2.443 | 0.015 |
NAMEBLUE_BELL_HILL | 23855.225 | 52093.575 | 0.458 | 0.647 |
NAMEBREWERYTOWN | -66421.604 | 34047.828 | -1.951 | 0.051 |
NAMEBRIDESBURG | -33288.625 | 28495.489 | -1.168 | 0.243 |
NAMEBURHOLME | 22678.195 | 33311.026 | 0.681 | 0.496 |
NAMEBUSTLETON | 9083.709 | 21455.471 | 0.423 | 0.672 |
NAMEBYBERRY | -152303.892 | 50853.624 | -2.995 | 0.003 |
NAMECALLOWHILL | -57479.036 | 67376.749 | -0.853 | 0.394 |
NAMECARROLL_PARK | -13639.025 | 32370.249 | -0.421 | 0.674 |
NAMECEDAR_PARK | 91848.072 | 36103.192 | 2.544 | 0.011 |
NAMECEDARBROOK | -29504.447 | 25464.674 | -1.159 | 0.247 |
NAMECHESTNUT_HILL | 192433.090 | 26600.252 | 7.234 | 0.000 |
NAMECLEARVIEW | -2487.014 | 39341.471 | -0.063 | 0.950 |
NAMECOBBS_CREEK | 16924.000 | 31550.192 | 0.536 | 0.592 |
NAMECRESCENTVILLE | -1060.511 | 57603.588 | -0.018 | 0.985 |
NAMEDEARNLEY_PARK | -57664.481 | 34975.202 | -1.649 | 0.099 |
NAMEDICKINSON_NARROWS | -61449.378 | 35922.057 | -1.711 | 0.087 |
NAMEDUNLAP | -54856.694 | 41086.849 | -1.335 | 0.182 |
NAMEEAST_FALLS | 99498.823 | 31849.632 | 3.124 | 0.002 |
NAMEEAST_KENSINGTON | 50476.992 | 33709.002 | 1.497 | 0.134 |
NAMEEAST_OAK_LANE | -75957.043 | 29019.226 | -2.617 | 0.009 |
NAMEEAST_PARKSIDE | -96268.424 | 39336.281 | -2.447 | 0.014 |
NAMEEAST_PASSYUNK | 33483.894 | 34657.539 | 0.966 | 0.334 |
NAMEEAST_POPLAR | -106961.230 | 48732.852 | -2.195 | 0.028 |
NAMEEASTWICK | -622.977 | 34845.180 | -0.018 | 0.986 |
NAMEELMWOOD | 4793.280 | 31604.040 | 0.152 | 0.879 |
NAMEFAIRHILL | -26062.385 | 38193.657 | -0.682 | 0.495 |
NAMEFAIRMOUNT | 45949.918 | 35408.313 | 1.298 | 0.194 |
NAMEFELTONVILLE | -15557.392 | 28430.643 | -0.547 | 0.584 |
NAMEFERN_ROCK | -33460.333 | 31686.198 | -1.056 | 0.291 |
NAMEFISHTOWN | 55347.942 | 32925.212 | 1.681 | 0.093 |
NAMEFITLER_SQUARE | 354300.464 | 43313.640 | 8.180 | 0.000 |
NAMEFOX_CHASE | 18939.843 | 22049.062 | 0.859 | 0.390 |
NAMEFRANCISVILLE | -93953.052 | 37651.349 | -2.495 | 0.013 |
NAMEFRANKFORD | -55434.931 | 25885.719 | -2.142 | 0.032 |
NAMEFRANKLINVILLE | -20779.315 | 36170.837 | -0.574 | 0.566 |
NAMEGARDEN_COURT | 137900.477 | 42495.015 | 3.245 | 0.001 |
NAMEGERMANTOWN_EAST | -42393.000 | 26355.213 | -1.609 | 0.108 |
NAMEGERMANTOWN_MORTON | -30698.933 | 29280.182 | -1.048 | 0.294 |
NAMEGERMANTOWN_PENN_KNOX | 9668.388 | 43071.006 | 0.224 | 0.822 |
NAMEGERMANTOWN_SOUTHWEST | -32441.731 | 31351.090 | -1.035 | 0.301 |
NAMEGERMANTOWN_WEST_CENT | -23829.366 | 32415.715 | -0.735 | 0.462 |
NAMEGERMANTOWN_WESTSIDE | -15048.286 | 38166.730 | -0.394 | 0.693 |
NAMEGERMANY_HILL | 29758.118 | 29312.835 | 1.015 | 0.310 |
NAMEGIRARD_ESTATES | 19701.931 | 33865.308 | 0.582 | 0.561 |
NAMEGLENWOOD | -50566.283 | 38874.899 | -1.301 | 0.193 |
NAMEGRADUATE_HOSPITAL | -21343.972 | 37725.365 | -0.566 | 0.572 |
NAMEGRAYS_FERRY | -86619.740 | 33827.883 | -2.561 | 0.010 |
NAMEGREENWICH | -19475.542 | 37230.024 | -0.523 | 0.601 |
NAMEHADDINGTON | -9678.776 | 31976.804 | -0.303 | 0.762 |
NAMEHARROWGATE | -17226.857 | 30729.207 | -0.561 | 0.575 |
NAMEHARTRANFT | -76415.644 | 33764.438 | -2.263 | 0.024 |
NAMEHAVERFORD_NORTH | -119916.837 | 44240.581 | -2.711 | 0.007 |
NAMEHAWTHORNE | -65396.685 | 42714.996 | -1.531 | 0.126 |
NAMEHOLMESBURG | -36366.901 | 21898.495 | -1.661 | 0.097 |
NAMEHUNTING_PARK | -15837.410 | 30663.227 | -0.516 | 0.606 |
NAMEJUNIATA_PARK | -9587.595 | 28742.008 | -0.334 | 0.739 |
NAMEKINGSESSING | -16713.237 | 32894.986 | -0.508 | 0.611 |
NAMELAWNDALE | -23342.252 | 24375.594 | -0.958 | 0.338 |
NAMELEXINGTON_PARK | 47580.634 | 29014.632 | 1.640 | 0.101 |
NAMELOGAN | -50539.356 | 28168.190 | -1.794 | 0.073 |
NAMELOGAN_SQUARE | 96881.443 | 41583.433 | 2.330 | 0.020 |
NAMELOWER_MOYAMENSING | -14796.671 | 33166.742 | -0.446 | 0.656 |
NAMELUDLOW | -126672.567 | 44551.269 | -2.843 | 0.004 |
NAMEMANAYUNK | 23543.315 | 26340.398 | 0.894 | 0.371 |
NAMEMANTUA | -100955.282 | 37763.526 | -2.673 | 0.008 |
NAMEMAYFAIR | -3113.029 | 22504.502 | -0.138 | 0.890 |
NAMEMCGUIRE | -20470.241 | 40807.442 | -0.502 | 0.616 |
NAMEMELROSE_PARK_GARDENS | -49188.376 | 32788.425 | -1.500 | 0.134 |
NAMEMILL_CREEK | -28154.092 | 34665.280 | -0.812 | 0.417 |
NAMEMILLBROOK | -32022.950 | 28539.903 | -1.122 | 0.262 |
NAMEMODENA | -25566.749 | 24704.705 | -1.035 | 0.301 |
NAMEMORRELL_PARK | -20171.705 | 26015.772 | -0.775 | 0.438 |
NAMEMOUNT_AIRY_EAST | -13797.497 | 24467.224 | -0.564 | 0.573 |
NAMEMOUNT_AIRY_WEST | -3215.829 | 26921.057 | -0.119 | 0.905 |
NAMENEWBOLD | -33112.641 | 35184.325 | -0.941 | 0.347 |
NAMENICETOWN | -4117.793 | 35047.196 | -0.117 | 0.906 |
NAMENORMANDY_VILLAGE | 10148.206 | 43143.818 | 0.235 | 0.814 |
NAMENORTH_CENTRAL | -110199.533 | 34929.846 | -3.155 | 0.002 |
NAMENORTHERN_LIBERTIES | -54988.249 | 36717.795 | -1.498 | 0.134 |
NAMENORTHWOOD | -39679.709 | 27897.136 | -1.422 | 0.155 |
NAMEOGONTZ | -37362.801 | 26169.313 | -1.428 | 0.153 |
NAMEOLD_KENSINGTON | -21715.145 | 35840.606 | -0.606 | 0.545 |
NAMEOLNEY | -35809.804 | 25780.602 | -1.389 | 0.165 |
NAMEOTHERS | 159496.761 | 47821.591 | 3.335 | 0.001 |
NAMEOVERBROOK | -7237.417 | 28863.711 | -0.251 | 0.802 |
NAMEOXFORD_CIRCLE | -11048.623 | 22848.657 | -0.484 | 0.629 |
NAMEPACKER_PARK | 36516.927 | 38251.130 | 0.955 | 0.340 |
NAMEPARKWOOD_MANOR | -29901.403 | 26079.668 | -1.147 | 0.252 |
NAMEPASCHALL | -17043.822 | 31813.161 | -0.536 | 0.592 |
NAMEPASSYUNK_SQUARE | -24210.496 | 36061.728 | -0.671 | 0.502 |
NAMEPENNSPORT | -27408.722 | 35062.664 | -0.782 | 0.434 |
NAMEPENNYPACK | 16373.339 | 23556.723 | 0.695 | 0.487 |
NAMEPENNYPACK_WOODS | -32622.449 | 31699.974 | -1.029 | 0.303 |
NAMEPENROSE | 4735.753 | 34266.872 | 0.138 | 0.890 |
NAMEPOINT_BREEZE | -92744.583 | 34518.047 | -2.687 | 0.007 |
NAMEPOWELTON | 70861.367 | 68260.900 | 1.038 | 0.299 |
NAMEQUEEN_VILLAGE | 42061.360 | 37537.584 | 1.121 | 0.263 |
NAMERHAWNHURST | 31930.280 | 21888.038 | 1.459 | 0.145 |
NAMERICHMOND | 41300.018 | 31596.752 | 1.307 | 0.191 |
NAMERITTENHOUSE | 225261.447 | 42233.350 | 5.334 | 0.000 |
NAMERIVERFRONT | -18817.788 | 59635.268 | -0.316 | 0.752 |
NAMEROXBOROUGH | 37129.366 | 26384.156 | 1.407 | 0.159 |
NAMEROXBOROUGH_PARK | 23579.964 | 38195.037 | 0.617 | 0.537 |
NAMESHARSWOOD | -76375.672 | 45785.346 | -1.668 | 0.095 |
NAMESOCIETY_HILL | 157827.536 | 39694.403 | 3.976 | 0.000 |
NAMESOMERTON | 2882.903 | 21870.934 | 0.132 | 0.895 |
NAMESOUTHWEST_SCHUYLKILL | -15319.075 | 34234.156 | -0.447 | 0.655 |
NAMESPRING_GARDEN | -22094.949 | 39599.016 | -0.558 | 0.577 |
NAMESPRUCE_HILL | 219723.879 | 40576.763 | 5.415 | 0.000 |
NAMESTADIUM_DISTRICT | 47099.103 | 35656.531 | 1.321 | 0.187 |
NAMESTANTON | -105770.026 | 33704.611 | -3.138 | 0.002 |
NAMESTRAWBERRY_MANSION | -82565.409 | 33359.539 | -2.475 | 0.013 |
NAMESUMMERDALE | -21538.879 | 27240.615 | -0.791 | 0.429 |
NAMETACONY | -38871.002 | 22737.482 | -1.710 | 0.087 |
NAMETIOGA | -44156.582 | 32954.713 | -1.340 | 0.180 |
NAMETORRESDALE | -12148.653 | 23285.266 | -0.522 | 0.602 |
NAMEUPPER_KENSINGTON | -21619.929 | 31866.999 | -0.678 | 0.498 |
NAMEUPPER_ROXBOROUGH | -255.324 | 24484.536 | -0.010 | 0.992 |
NAMEWALNUT_HILL | 66325.715 | 39412.958 | 1.683 | 0.092 |
NAMEWASHINGTON_SQUARE | -38814.789 | 44051.355 | -0.881 | 0.378 |
NAMEWEST_KENSINGTON | -21448.120 | 33916.230 | -0.632 | 0.527 |
NAMEWEST_OAK_LANE | -38806.419 | 23723.676 | -1.636 | 0.102 |
NAMEWEST_PARKSIDE | -22817.113 | 57916.022 | -0.394 | 0.694 |
NAMEWEST_PASSYUNK | -57094.888 | 34254.726 | -1.667 | 0.096 |
NAMEWEST_POPLAR | -149198.102 | 49603.709 | -3.008 | 0.003 |
NAMEWEST_POWELTON | -72672.246 | 40909.125 | -1.776 | 0.076 |
NAMEWHITMAN | -26867.258 | 34173.513 | -0.786 | 0.432 |
NAMEWINCHESTER_PARK | -10616.927 | 33056.006 | -0.321 | 0.748 |
NAMEWISSAHICKON | 32862.285 | 30044.667 | 1.094 | 0.274 |
NAMEWISSAHICKON_HILLS | 40770.009 | 35614.181 | 1.145 | 0.252 |
NAMEWISSINOMING | -27769.979 | 23659.387 | -1.174 | 0.241 |
NAMEWISTER | -48071.736 | 31639.190 | -1.519 | 0.129 |
NAMEWYNNEFIELD | -31312.245 | 32489.813 | -0.964 | 0.335 |
NAMEWYNNEFIELD_HEIGHTS | 87649.238 | 39433.444 | 2.223 | 0.026 |
NAMEYORKTOWN | -120781.881 | 47932.988 | -2.520 | 0.012 |
In our model, the features explain approximately 76% of the variation in price (Adjusted R2 = 75.73%) and most of their coefficients are statistically significant (p<0.05). Several features, such as general construction and architectural styles, are converted to categorical variables, allowing their coefficients to be estimated and linked to housing prices. The remaining insignificant coefficients are of neighborhood names or zoning codes, which are less likely to regroup due to spatial disparities.
To dictate how useful the model is for decision making, we used testing set to analyse goodness of fit indicators.
philly.test <-
philly.test %>%
mutate(Regression = "Baseline Regression",
sale_price.Predict = predict(reg.training, philly.test),
sale_price.Error = sale_price.Predict - sale_price,
sale_price.AbsError = abs(sale_price.Predict - sale_price),
sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
filter(sale_price < 5000000)
MAE <- mean(philly.test$sale_price.AbsError, na.rm = T)
MAPE <- mean(philly.test$sale_price.APE, na.rm = T)
acc <- data.frame(MAE, MAPE)
acc %>%
knitr::kable(caption = "<b>Table 4.2 MAE and MAPE of the testing dataset</b>", align = "cc") %>%
kableExtra::kable_material(lightable_options = c("striped", "hover"))
MAE | MAPE |
---|---|
65671.32 | 0.3592045 |
The Mean Absolute Error (MAE) is $65671.32, representing the average difference between predicted and observed prices. The relatively high value reflects the diversity in housing prices, suggesting a limitation for linear regression. The Mean Absolute Percent Error (MAPE) indicates that the prediction error amounts to 35.92% of the housing prices on average.
ggplot(
philly.test, aes(sale_price.Predict, sale_price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se=F, colour = "#6D9EC1") +
geom_abline(intercept = 0, slope = 1, color="#E46726",size=1) +
theme_minimal() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = ggtext::element_markdown(size = 12),
plot.caption = element_text(hjust = 0)) +
labs(title = 'Predicted sale price as a function of observed price',
subtitle = glue("<span style='color:#E46726;'>Perfect prediction</span> vs. <span style='color:#6D9EC1;'>Average prediction</span>"),caption = "Figure 4.1",
x = "Predicted Sale Price",
y = "Observed Sale Price")
Given the high MAE and MAPE, data visualization was used to compare predicted and observed prices. The red and blue lines nearly overlap, indicating the model performs well overall. However, the slight deviation suggests that, on average, the model’s predictions are a little higher than the observed prices. Examining the data points further reveals that prediction accuracy decreases as prices rise, confirming our concerns about the variability in the higher price ranges of the dataset. This suggests that our model performs particularly well for lower-valued housing prices.
After making predictions on a single hold-out test set, we partitioned the dataset into 100 equal-sized subsets, training on the remaining data and testing on each subset. This process is known as the K-fold cross-validation method.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(sale_price ~ log(total_livable_area+1) + total_livable_area + crime_nn5 + LPSS_PER1000 +
log(dist1+1) +log(distance_to_city_hall+1) + log(distance_to_nearest_water+1) +
Estimate_Mean.family.income..dollars. + Percent_White.alone + log(frontage+1) +
central_air + exterior_condition + fireplaces + garage_spaces + general_construction +
interior_condition + number_of_bathrooms + number_of_bedrooms + quality_grade +
separate_utilities + topography + building_code_description_new + zoning +
NAME,
data = st_drop_geometry(model_philly),
method = "lm", trControl = fitControl, na.action = na.pass)
kable(reg.cv$results, digits = 3,caption = "<b>Table 4.3 Cross-Validation Results</b>", align = "ccccccc") %>%
kableExtra::kable_material(lightable_options = c("striped", "hover"))
intercept | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|
TRUE | 115492.5 | 0.765 | 65851.93 | 37687.33 | 0.07 | 6719.758 |
Mean <- mean(reg.cv$resample[,3])
SD <- sd(reg.cv$resample[,3])
stat_MAE <- data.frame(Mean, SD)
stat_MAE %>%
knitr::kable(caption = "<b>Table 4.4 Mean and standard deviation of the cross-validation MAE</b>", align = "cc") %>%
kableExtra::kable_material(lightable_options = c("striped", "hover"))
Mean | SD |
---|---|
65851.93 | 6719.758 |
The standard deviation of MAE (6719.76) reflects significant variation across the 100 folds. Additionally, the mean of MAE in cross-validation (65851.93) is quite similar to the MAE in our test set (65671.32), indicating that the model was well-trained on the existing variables and therefore shows an average error.
ggplot(data = data.frame(mae = reg.cv$resample[,3]),aes(x = mae)) +
geom_histogram(color = "white", fill = "#6D9EC1") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 50),
labels = function(x) ifelse(x %% 5000 == 0, x, "")) +
labs(title = "Distribution of MAE",
subtitle = "K fold cross validation; k=100.",
caption = "Figure 4.2",
x = "Mean Absolute Error", y = "Count") +
theme_minimal() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(hjust = 0)) +
geom_vline(aes(xintercept=mean(mae, na.rm=T)),
color="#E46726", linetype="dashed", size=1) +
geom_vline(aes(xintercept= MAE),
color="#E46726", linetype="dashed", size=.8, alpha = .8) +
annotate(geom = "text", label = paste("Mean of MAE:", round(Mean, 2)), # Insert correlation here
x = 67000, y = 10,
hjust = 0, size = 4, color = "#E46726") +
annotate(geom = "text", label = paste("MAE in the test set:\n", round(MAE, 2)), # Insert correlation here
x = 61000, y = 9,
hjust = 1, size = 4, color = "#E46726", alpha = .8)
Visualized in a histogram, the distribution of MAE in cross-validation peaks around 63000 and exhibits a long right tail. This suggests that the model predicts inconsistently and may be unreliable for predicting houses that have not been sold recently.
philly.test_predict <-
philly.test[which(philly.test$sale_price.Error != 0),]
coords.test <- st_coordinates(philly.test_predict)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
philly.test_predict$lagPriceError <- lag.listw(spatialWeights.test, philly.test_predict$sale_price.Error)
philly.test <- philly.test_predict
ggplot() +
geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = T) +
geom_sf(data = philly.test_predict, aes(colour = q5(sale_price.Error), na.rm = T),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(philly.test_predict,"sale_price.Error"),
name="Quintile Breaks") +
guides(color = guide_legend(override.aes = list(size = 3))) +
labs(title="Sale price errors on the test set",
caption = "Figure 4.3") +
mapTheme() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12),
panel.border = element_rect(colour = "black", fill=NA, size=1))
What might be causing the high MAE? Although our model incorporates internal characteristics, public services/amenities, and some spatial features, other factors influencing the spatial distribution of home prices remain unexplored. As the map illustrates, sale price errors in Philadelphia are generally randomly distributed, but with some clustering both within and across neighborhoods. How much do our model errors cluster? Calculating the average weighted model error of its five nearest neighbors as the “spatial lag”, we delved deeper into the spatial autocorrelation test.
cor_error <- cor(philly.test_predict$lagPriceError, philly.test_predict$sale_price.Error, method = "pearson")
ggplot(philly.test_predict, aes(x = lagPriceError, y = sale_price.Error))+
geom_point(size = .5, color = "#6D9EC1") + geom_smooth(method = "lm", se=F, colour = "#E46726") +
labs(title = "Error as a function of the spatial lag of price",
x = "Spatial lag of errors (Mean error of 5 nearest neighbors)", y = "Sale price errors",
caption = "Figure 4.4") +
annotate(geom = "text", label = paste("Pearson Correlation:", round(cor_error, 3)), # Insert correlation here
x = -700000, y = -1000000,
hjust = 0, vjust = 2,
size = 4) +
theme_minimal() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.caption = element_text(hjust = 0))
Comparing the spatial lag with the actual values helps estimate the degree of spatial autocorrelation. The relationship visualized above shows that as home price errors increase, nearby home price errors tend to rise as well. The correlation is 0.225, which is marginal but significant.
philly.test_nonzero <- philly.test %>%
filter(sale_price.Error != 0)
moranTest <- moran.mc(philly.test_nonzero$sale_price.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#E46726",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="<span style='color:#E46726;'>Observed</span> and <span style='color:#7F7F7F;'>permuted</span> Moran's I",
x="Moran's I",
y="Count",
caption = "Figure 4.5") +
annotate(geom = "text", label = paste("Statistic:", round(moranTest$statistic,3),"\np-value:", moranTest$p.value), # Insert correlation here
x = 0.2, y = 300,
hjust = 0,
size = 4) +
theme_minimal() +
theme(plot.title = ggtext::element_markdown(size = 18, face = "bold"),
plot.caption = element_text(hjust = 0))
Another approach to measure spatial autocorrelation is Moran’s I, where a positive value close to one indicates strong positive spatial autocorrelation. The histogram above shows 999 randomly permuted I values, with the observed I marked by the orange line. The observed I, higher than all random permutations, confirms spatial autocorrelation. An I of 0.117 seems marginal, but a p-value of 0.001 still indicates statistically significant clustering.
Both the spatial lag and Moran’s I test show that our model errors exhibit a slight positive spatial autocorrelation, suggesting the presence of unaccounted factors.
How does the spatial autocorrelation manifest in our model? From the perspective of geospatial features, including neighborhoods and median income, we further analysed the spatial process of sale price predictions and tested the generalizability across urban contexts.
nhoods_model <- st_drop_geometry(philly.test) %>%
group_by(NAME) %>%
summarise(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
ungroup() %>%
left_join(nhoods) %>%
st_sf()
ggplot() +
geom_sf(data = nhoods, fill = "transparent", color = "grey") +
geom_sf(data = nhoods_model, aes(fill = mean.MAPE), color = "grey") +
scale_fill_gradient(low = "#6D9EC1", high = "#E46726", name = "MAPE") +
labs(title = "Mean test set MAPE by neighborhood",
caption = "Figure 4.6") +
mapTheme() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12),
panel.border = element_rect(colour = "black", fill=NA, size=1))
The MAPE by neighborhood reveals significantly less accurate predictions in North and West Philadelphia (highlighted in red). The vacant areas have no housing sales data, including northern Center City, University City, Northeast Philadelphia Airport and its surroundings, southern South Philadelphia, and parks. In North Philadelphia, the red areas are predominantly home to Hispanic and Black residents, while the surrounding population is more diverse, contributing to the high MAPE. In West Philadelphia, the impact of the University of Pennsylvania’s role in gentrification in the east and rising crime in the west have led to deteriorating areas with high MAPE. Both red areas also have lower sale prices, confirming the spatial pattern of model generalizability.
nhoods_model <- left_join(nhoods_model,
st_drop_geometry(philly.test) %>%
group_by(NAME) %>%
summarise(meanPrice = mean(sale_price, na.rm = T))
)
ggplot(data = st_drop_geometry(nhoods_model)) +
geom_point(aes(x = meanPrice, y = mean.MAPE), size = 2, color = "#E46726", alpha = 0.7) +
geom_vline(aes(xintercept= 375000),
color="#6D9EC1", linetype="dashed", size=.8, alpha = .8) +
geom_hline(aes(yintercept= .4),
color="#6D9EC1", linetype="dashed", size=.8, alpha = .8) +
labs(title = "MAPE as a function of mean price by neighborhood",
x = "Mean Price by Neighborhood",
y = "MAPE by Neighborhood",
caption = "Figure 4.7") +
theme_minimal() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.caption = element_text(hjust = 0)) +
annotate(geom = "text", label = paste("(275000,0.4)"),
x = 275000, y = 0.4,
hjust = -.8, vjust = -1, size = 4, color = "#6D9EC1", alpha = .8)
Upon further examining the relationship between price and MAPE across neighborhoods, it is evident that MAPE decreases as the mean price increases up to $375,000, but starts to rise when the mean price exceeds this threshold. In our model, mean prices between $125,000 and $500,000 are associated with relatively lower MAPE. The spatial pattern of MAPE (see Figure 4.6) also shows that lower-priced, declining areas tend to have higher MAPE. These two graphs highlight that our model is more effective in neighborhoods with mid-range prices and that the socioeconomic context of neighborhoods impacts pricing.
ggplot() +
geom_sf(data = tracts22, fill = "transparent", color = "grey") +
geom_sf(data = na.omit(tracts22), aes(fill = incomeContext), color = "grey90") +
scale_fill_manual(values = c("#E46726", "#6D9EC1"), name="Income Context") +
labs(title = "Tracts grouped by income",
subtitle = "Philadelphia, 2022",
caption = "Figure 4.8") +
mapTheme() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12),
panel.border = element_rect(colour = "black", fill=NA, size=1))
income_table <- st_join(philly.test, tracts22) %>%
group_by(incomeContext) %>%
summarise(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
dplyr::select(-"<NA>") %>%
mutate('No Context' = scales::percent(MAPE))
income_table %>%
knitr::kable(caption = "<b>Table 4.5 Test set MAPE by tract income context</b>", align = "ccc") %>%
kableExtra::kable_material(lightable_options = c("striped", "hover"))
High Income | Low Income | No Context |
---|---|---|
24% | 43% | 36% |
At the urban scale, we compared MAPE across different income contexts to assess the generalizability of our model. Tracts with median incomes above the citywide mean were designated as ‘High Income’, effectively dividing Philadelphia into two categories. The nearly 1:2 ratio in MAPE highlights a significant difference in the model’s goodness of fit, indicating higher accuracy in predicting housing prices in high-income areas. The neighborhoods with high MAPE (see Figure 4.6) fall predominantly within the low-income group, suggesting a connection between race, crime, income, and housing prices.
Using our model, the predicted values for where ‘toPredict’ is both “MODELLING” and “CHALLENGE” are as follows.
philly_finish <- philly %>%
mutate(sale_price.Predict = predict(reg.training, philly))
ggplot() +
geom_sf(data = nhoods, fill = "grey40", color = "grey", na.rm = T) +
geom_sf(data = philly_finish %>% filter(!is.na(sale_price.Predict)), aes(colour = q5(sale_price.Predict), na.rm = T),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(philly_finish,"sale_price.Predict"),
name="Quintile Breaks") +
guides(color = guide_legend(override.aes = list(size = 3))) +
labs(title="Predicted Values of 'Modelling' and 'Challenge'",
subtitle = "Philadelphia, 2022",
caption = "Figure 4.9") +
mapTheme() +
theme(plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12),
panel.border = element_rect(colour = "black", fill=NA, size=1))
Our model effectively predicts 76% of the variation in housing prices, particularly with high accuracy for sales priced below $1,000,000. Our logarithmic transformation and piecewise splitting of distance to city hall is both interesting and essential to the model. Other key features such as distance to the city center, neighborhood names, and zoning numbers significantly reduce prediction errors by approximately 20% when included in training. The Mean Absolute Error is $65671.32 and the Mean Absolute Percent Error is 35.92%. Prediction errors remain high due to the dataset’s wide price variation and unaccounted factors influencing spatial dynamics. From the perspective of urban context, the model performs poorly in North Philadelphia, where clustering of Hispanic and Black residents presents challenges, and in West Philadelphia, where rising crime and gentrification add complexity. These decaying factors contribute to lower housing prices and a more intricate spatial process. Nonetheless, the model performs well in neighborhoods with mid-range housing prices, where the larger dataset provides stronger training and testing opportunities.
I would recommend our model to Zillow for customers looking to predict mid-range housing prices in Philadelphia. The model is well-trained and accounts for internal characteristics, public services/amenities, and spatial features, demonstrating a strong fit within the context of Philadelphia. While there is some spatial autocorrelation present and OLS regression has its limitations, we plan to explore additional socioeconomic factors that influence the spatial process of housing prices, and apply non-linear regression and other machine learning techniques to further enhance the model.