1 Introduction

A bike share program is a form of shared transportation service where bicycles are made available for low-cost use by the public. Typically, these programs employ docking systems, allowing users to rent a bike from one dock and return it to another within the network. Consequently, re-balancing bicycles across docking stations is essential to ensure bikes are available for pick-up and that open docking spaces are accessible for returns.

Various strategies can be employed to operate a bike share system. In Philadelphia, bike share trips peak in the city center during the PM rush hours on weekdays and around midday on weekends. To address this demand pattern, my re-balancing plan proposes dispatching trucks for bike redistribution overnight on weekdays and during the AM rush hours on weekends. This approach is efficient for moving bikes to understocked docks and allows operators to assess bike conditions conveniently.

In this report, I utilized the Indego’s bike-share data to build a predictive model for shared bike usage to help guide rebalancing. The model is designed to forecast bike demand at similar times on weekdays and weekends. Several regression models were tested, emphasizing the one with lagged time variables but without holiday lags, achieving a low mean absolute error (MAE) of 0.715. This model performs particularly well in predicting overnight demand on weekdays and shows a high correlation with socio-economic variables. The predictions during PM rush hours could be further improved.

2 Data Wrangling

2.1 Import Bike Share Data

The 5-week bike share data, spanning from May 21st to June 24th, 2022 (weeks 20–25 of the year), was loaded for analysis. To facilitate further analysis, I standardized the data into 15-minute and 60-minute intervals, and extracted information on the week and day of the week for each observation.

dat <- read.csv("data/indego-trips-2022-q2.csv")
ride <- dat %>%
   mutate(start_time = as.POSIXct(start_time, format = "%m/%d/%Y %H:%M")) %>%
  filter(start_time >= as.POSIXct("2022-05-21 00:00"), 
         start_time <= as.POSIXct("2022-06-24 23:59")) %>%
  mutate(interval60 = floor_date(start_time, unit = "hour"),
         interval15 = floor_date(start_time, unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

2.2 Import Census Data

Because census geography and variables are often collinear with the fixed effects of station locations, I only use them to test generalizability. Three features are derived from the 2022 ACS 5-year estimates: the proportion of the white population, the median household income, and the proportion of individuals using public transportation.

census2022 <- 
  get_acs(geography = "tract",
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year=2022, state= 42, geometry=TRUE, output = "wide", county = 101) %>% 
  st_transform('EPSG:2272') %>%
  rename(Total_Pop =  B01003_001E,
         `Median Household Income` =  B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  dplyr::select(Total_Pop, `Median Household Income`, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,GEOID, geometry) %>%
  mutate(`% of White Population` = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         `% of Public Transit` = Total_Public_Trans / Means_of_Transport)

tracts <- 
  census2022 %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf %>%
  st_transform(crs=4326)

ride_census <- st_join(ride %>% 
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(end_lon) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
        tracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., tracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

2.3 Import Weather Data

Since people rarely travel during bad weather, such as stormy days or high-temperature days, incorporating weather data is essential for accurately predicting bike-share usage. Philadelphia’s weather data was obtained using the riem_measures function, with KPHL as the designated weather station.

weather.Panel <- 
  riem_measures(station = "KPHL", date_start = "2022-05-21", date_end = "2022-06-25") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = floor_date(valid, unit = "hour")) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))%>%
  mutate(interval60 = as.POSIXct(interval60, format = "%Y-%m-%d %H:%M:%S"))

2.4 Create Space-Time Panel

The ride.panel dataset is constructed with each row representing a specific time period in the study. Instances with no trips are assigned a value of 0. All variables required for subsequent predictions are then joined to this panel.

study.panel <- 
  expand.grid(interval60=unique(ride_census$interval60), 
              start_station = unique(ride_census$start_station)) %>%
  left_join(., ride_census %>%
              select(start_station, Origin.Tract, start_lon, start_lat )%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

ride.panel <- 
  ride_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  ungroup()
####### mitigate ########
ride_weather <- ride.panel %>%
  mutate(interval60 = as.factor(interval60)) %>%
  left_join(weather.Panel %>% mutate(interval60 = as.factor(interval60)))
ride_weather_process <- ride_weather
ride_weather_process <- ride_weather %>%
  mutate(
    interval60 = if_else(
      str_detect(interval60, "^\\d{4}-\\d{1,2}-\\d{1,2}$"),
      paste(interval60, "00:00:00"),
      interval60
    ),
    interval60 = ymd_hms(interval60)
  )
ride_weather <- ride_weather_process %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)
####### mitigate ########
ride.panel <- ride_weather
ride.panel <- 
  left_join(ride.panel, census2022 %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

2.5 Create Time Lags

Creating time lag variables adds nuance to the demand patterns by capturing relationships between current demand and demand in subsequent hours and throughout the day. Given the Memorial Day holiday on May 28, I include dummy variables to indicate proximity to this holiday, accounting for the effects of the three-day weekend.

Evaluating the correlations of these lag variables reveals a strong Pearson’s R of 0.90 for lagHour and 0.87 for lag1day. This indicates that current demand is closely related to demand an hour ahead and to demand at the same time on the following day.

ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 147,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = replace_na(holidayLag, "0" ))

cor_lag <- 
  as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(Correlation = round(cor(Value, Trip_Count),2))

cor_lag%>%
  knitr::kable(caption = "<b>Table 1 Correlation between Trip Count and Lags</b>", align = "cc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover"))

**Table 1 Correlation between Trip Count and Lags**
Variable	Correlation
lagHour	0.90
lag2Hours	0.72
lag3Hours	0.52
lag4Hours	0.32
lag12Hours	-0.65
lag1day	0.87

2.6 Split Training and Test

The data is split into a 3-week training set (weeks 21 to 23) and a 2-week test set (weeks 24 and 25). If the temporal and spatial patterns in the training data are generalizable, they can be used to project into the near future.

ride.Train <- filter(ride.panel, week <= 23)
ride.Test <- filter(ride.panel, week > 23)

3 Exploratory Analysis

In this section, the bike share data is analyzed for temporal, spatial, weather, and demographic relationships. The results reveal a strong temporal correlation in usage patterns within weekdays and weekends, along with a consistent spatial concentration in Philadelphia’s city center.

3.1 Time Serial Autocorrelation

(1) Bike-share Trips by Week

An analysis of the bike-share trips over the five-week period reveals a clear daily periodicity, with noticeable lulls on weekends. There is little difference between the training and test sets.

# reference lines
mondays <- 
  mutate(ride.panel,
         monday = ifelse(dotw == "Mon" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0)
# plot
st_drop_geometry(rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing"))) %>%
    group_by(Legend, interval60) %>% 
      summarize(Trip_Count = sum(Trip_Count)) %>%
      ungroup() %>% 
      ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
        scale_colour_manual(values = palette2) +
        geom_vline(data = mondays, aes(xintercept = monday)) +
        labs(title="Bide-share Trips by Week: May 21st - June 24th",
             subtitle = "Week 21 - Week 25; Dotted Lines for Each Monday",
             caption = "Figure 3.1",
                       x="Day", y="Trip Count") +
        plotTheme + theme(panel.grid.major = element_blank(),legend.position = "bottom",aspect.ratio=3/9)

(2) Bike Share Trips on Weekend vs Weekday

Comparing bike-share trips across days of the week reveals distinct patterns between weekdays and weekends. From Monday to Friday, there is a modest peak around 7–8 a.m., followed by a more pronounced peak around 4–5 p.m., with trips then decreasing until midnight. On Saturdays and Sundays, however, bike-share usage rises above 2,000 trips between 12–6 p.m. Among weekdays, Tuesday and Wednesday show the highest usage, while Sunday experiences a peak around 3–4 p.m.

grid.arrange(
  ggplot(ride_census %>% mutate(hour=hour(interval60)))+
    geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
    scale_x_continuous(limits = c(0, 23)) +
    scale_colour_manual(values = palette7) +
    labs(title="Bike Share Trips by Day of the Week",
         subtitle = "Philadelphia, May 21st - June 24th, 2022",
         caption = "Figure 3.2",
       x="Hour", 
       y="Trip Counts")+
     plotTheme,

ggplot(ride_census %>% 
         mutate(hour = hour(interval15),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  scale_x_continuous(limits = c(0, 23)) +
  scale_colour_manual(values = palette2) +
  labs(title="Bike Share Trips by Weekend vs Weekday",
       subtitle = "Philadelphia, May 21st - June 24th, 2022",
       x="Hour", 
       y="Trip Counts")+
     plotTheme, ncol=2)

(3) Hourly Trips per Station

Most stations in Philadelphia average no more than three trips per hour, although some stations still see more than 10 trips per hour. Figure 3.4 illustrates the average number of bike-share trips per station across different time periods. Consistent with the earlier findings, peak trip activity occurs during rush hours. Specifically, the morning rush (7:00 a.m. to 10:00 a.m.), evening rush (3:00 p.m. to 6:00 p.m.), and midday (10:00 a.m. to 3:00 p.m.) periods see higher activity, while overnight hours (6:00 p.m. to midnight) typically experience zero to one trip per station.

ggplot(ride_census %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 1,fill="#fca50a")+
  labs(title="Bike Share Trips per Hour by Station",
       subtitle = "Philadelphia, May 21st - June 24th, 2022",
       caption = "Figure 3.3",
       x="Trip Counts", 
       y="Number of Stations")+
    scale_fill_manual(values = palette5) +
  plotTheme

ride_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 0.3,fill="#fca50a")+
  labs(title="Mean Number of Hourly Trips per Station",
       subtitle = "Philadelphia, May 21st - June 24th, 2022",
       caption = "Figure 3.4",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

(4) Time Lag Exploration

After calculating the mean values for each time interval, the correlations between trip counts and time lags per hour are plotted below. The strongest correlation is observed between trip counts and a 1-day lag (R = 0.45), followed closely by the correlation with a 1-hour lag (R = 0.44). As the lag increases, the correlation decreases, but the predictive power resurges with the 1-day lag. These features are likely to be strong predictors in the model.

plotData.lag <-
  filter(as.data.frame(ride.panel), week == 23) %>%
  dplyr::select(starts_with("lag"), Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  mutate(Variable = fct_relevel(Variable, "lagHour","lag2Hours","lag3Hours",
                                          "lag4Hours","lag12Hours","lag1day"))
correlation.lag <-
  group_by(plotData.lag, Variable) %>%
    summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"), 2))

ggplot(plotData.lag, aes(Value, Trip_Count)) +
  geom_point(size = 0.1, color = "#fca50a") +
  geom_text(data = correlation.lag, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "#932667") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Bike-share Trip Count as a Function of\nTime Lags per Hour",
       caption = "Figure 3.5") +
  plotTheme

3.2 Spatial Autocorrelation

The bike-share trip maps are organized by four time periods and by weekday/weekend. Hourly trip counts on weekdays are significantly higher than those on weekends, with weekday morning rush hour trips also surpassing weekend morning trip counts. However, even during holidays, the spatial pattern remains consistent, with trips concentrated in Philadelphia’s city center, suggesting a strong spatial autocorrelation.

ggplot()+
  geom_sf(data = tracts %>%
          st_transform(crs=4326),fill="grey30")+
  geom_point(data = ride_census %>% 
            mutate(hour = hour(interval60),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
              tally(),
            aes(x=start_lon, y = start_lat, color = n), 
            fill = "transparent", alpha = 0.6, size = 1.3)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "B")+
  ylim(min(ride_census$start_lat), max(ride_census$start_lat))+
  xlim(min(ride_census$start_lon), max(ride_census$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike Share Trips per Hour by Station",
       subtitle = "Philadelphia, May 21st - June 24th, 2022",
       caption = "Figure 3.6")+
  mapTheme

3.3 Space/Time Correlation

Bike share in Philadelphia demonstrates strong spatial and temporal dependencies, which are visualized in the animated map below. Using a single Monday from week 23 as a sample, the map shows the number of bike-share trips per 15-minute interval at each station. The map highlights that trip counts are highest around the city center.

week23 <-
  filter(ride , week == 23 & dotw == "Mon")

week23.panel <-
  expand.grid(
    interval15 = unique(week23$interval15),
    Origin.Tract = unique(ride_census$Origin.Tract))

ride.animation.data <-
  mutate(week23, Trip_Counter = 1) %>%
    right_join(week23.panel) %>% 
    group_by(interval15, start_lat, start_lon,Origin.Tract) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>% 
    left_join(tracts, by=c("Origin.Tract" = "GEOID")) %>%
    st_sf() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 1 ~ "0-1 trips",
                             Trip_Count > 1 & Trip_Count <= 3~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 5 ~ "3-5 trips",
                             Trip_Count > 5 ~ "5+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","0-1 trips","1-3 trips",
                                       "3-5 trips","5+ trips"))

rideshare_animation <-
  ggplot() +geom_sf(data = tracts %>%
          st_transform(crs=4326),fill="#000000")+
    geom_point(data = ride.animation.data, aes(x=start_lon, y = start_lat, color = Trips),fill = "transparent", alpha = 0.8, size = 2) +
    scale_color_manual(values = palette5) +
    labs(title = "Bike-share Trips for One Day in June 2022",
         subtitle = "15 minute intervals: {current_frame}",
         caption = "Figure 3.7") +
    transition_manual(interval15) +
    mapTheme

animate(rideshare_animation, duration=20, renderer = gifski_renderer())

3.4 Weather Correlation

(1) Precipitation

The binary bar plot shows that snowy or rainy days are associated with a decreased likelihood of bike-share trips.

weather.plot <- st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Precipitation = first(Precipitation)) %>%
  mutate(isPercip = ifelse(Precipitation > 0, "Rain/Snow", "None")) %>%
  group_by(isPercip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count))

ggplot(data = weather.plot, aes(isPercip, Mean_Trip_Count)) + 
  geom_bar(stat = "identity",fill="#fca50a", width = .6) +
  labs(title="Bike Share Trip Count by Percipitation",x="Percipitation", y="Mean Trip Count",
       caption = "Figure 3.8") +
  plotTheme

(2) Temperature

The relationship between temperature and bike-share trips varies across different weeks, suggesting that combining time period features with temperature data will be valuable for model development.

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point(color = "#fca50a") + geom_smooth(method = "lm", se= FALSE, colour = "#932667") +
    facet_wrap(~week, ncol=3) + 
    labs(title="Trip Count as a Fuction of Temperature by Week",
         x="Temperature", y="Mean Trip Count",
         caption = "Figure 3.9") +
    plotTheme

(3) Wind Speed

Wind speed appears to have a consistent effect on the number of bike-share trips across time, so I have excluded it from the model.

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Wind_Speed = first(Wind_Speed)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Wind_Speed, Trip_Count)) + 
    geom_point(color = "#fca50a") + geom_smooth(method = "lm", se= FALSE, colour = "#932667") +
    facet_wrap(~week, ncol=3) + 
    labs(title="Trip Count As a Fuction of Wind Speed by Week",
         caption = "Figure 3.10",
         x="Wind Speed", y="Mean Trip Count") +
    plotTheme

4 Modeling and Validation

This analysis involves training models on three weeks of data (weeks 21–23) and testing them on the following three weeks (weeks 24–25). Ordinary Least Squares (OLS) regression is used to build the model.

4.1 Estimate a Ride Share Forecast

This sections designs four different linear regressions which are estimated on ride.Train, each with different fixed effects:

reg1 focuses on just time, including hour fixed effects, day of the week, and Temperature.
reg2 focuses on just space effects with the station (the amenity feature) and other critical fixed effects.
reg3 includes both time and space fixed effects.
reg4 adds the time lag features, including laghour, lag2hours, lag3hours, and lag1day.
reg5 adds the holiday effects, including holiday and holidaylag.

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

4.2 Validate Test Set by Time

In this section, the Mean Absolute Error (MAE) is calculated for each model using the test set. The MAE for the time effects model (reg1) in week 24 is 0.851, which is comparable to the mean observed bike-share trips. Introducing time lags significantly improves the model’s predictive power, reducing the MAE of model D in week 24 to 0.715. However, incorporating holiday time lags has minimal impact on model performance, with a slight increase in the MAE.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  mutate(week = as.factor(week)) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors\nby Model Specification and Week",
         caption = "Figure 4.1") +
  plotTheme

For each model, the predicted and observed trip counts are extracted from the spatial context, and their means are plotted in time series form below. Models A and C exhibit the same time trend due to the time fixed effects. In contrast, Models D and E show greater sophistication in predicting peak periods, highlighting the critical role of time lags.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
  scale_colour_manual(values = palette2) +
      labs(title = "Predicted/Observed Bike Share Time Series", subtitle = "Philadelphia; A test set of 2 weeks",
           x = "Hour", y= "Station Trips",
           caption = "Figure 4.2") +
      plotTheme

4.3 Validate Test Set by Space

Since Model D exhibits the best goodness of fit among the four regressions, I map its MAE to validate the model across space. The highest errors are found in the city center, where trip volumes are higher, with the largest error occurring along the bank of the Schuylkill River.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon)) %>%
    select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags") %>%
  group_by(start_station, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = tracts, color = "grey80", fill = "white")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.6, size=3)+
  scale_colour_viridis(direction = -1, discrete = FALSE, option = "B")+
  ylim(min(ride_census$start_lat), max(ride_census$start_lat))+
  xlim(min(ride_census$start_lon), max(ride_census$start_lon))+
  labs(title="Mean Abs Error of Test Set, Model D",
       caption = "Figure 4.3")+
  mapTheme

4.4 Space-Time Error Evaluation

For Model D, predicted and observed trip counts are sorted by four time periods and by weekday/weekend. The scatter plots reveal significant underprediction across all time periods, with the best performance during overnight hours on weekdays and the worst during the AM rush on weekends.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression,dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction), color = "#fca50a")+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#932667")+
    geom_abline(slope = 1, intercept = 0, linetype = "dashed")+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted Bike Share Trips by Time",
       x="Observed trips", 
       y="Predicted trips",
       caption = "Figure 4.4")+
  plotTheme

In terms of spatial patterns, I evaluate the MAE of Model D across four time periods and by weekday/weekend. The prediction for the PM rush on weekdays shows a higher MAE than the other time periods, likely due to the concentration of peak trip counts in the city center during this time.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression,dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = tracts, color = "grey80", fill = "white")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.6,size=2)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "B")+
  ylim(min(ride_census$start_lat), max(ride_census$start_lat))+
  xlim(min(ride_census$start_lon), max(ride_census$start_lon))+
  labs(title="Mean Abs Error of Test Set, Model D",
       caption = "Figure 4.5")+
  facet_grid(weekend~time_of_day)+
  mapTheme

The MAE of Model D shows a strong correlation with socio-economic variables, indicating that higher prediction errors tend to occur in areas with higher median household incomes, better public transit access, and a larger white population.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),dotw = map(data, pull, dotw),
           `% of Public Transit` = map(data, pull, `% of Public Transit`),
           `Median Household Income` = map(data, pull, `Median Household Income`),
           `% of White Population` = map(data, pull, `% of White Population`)) %>%
     select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression,dotw, `% of Public Transit`, `Median Household Income`, `% of White Population`) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
   mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
   group_by(start_station, `% of Public Transit`, `Median Household Income`, `% of White Population`) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), color = "#fca50a", alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE, color = "#932667")+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a Function of Socio-economic Variables",
       caption = "Figure 4.6",
       y="Mean Absolute Error (Trips)")+
  plotTheme

4.5 Cross-validation

I perform cross-validation using a “4-week training, 1-week testing” method with the 5-week bike-share data. The results show a mean MAE of 0.70 and a mean RMSE of 1.09, indicating a good fit of the model.

# Split the data into five separate weeks
weeks <- split(ride.panel, ride.panel$week)

# Initialize a list to store the results of each fold
results <- list()

# Perform 5-fold cross-validation
for (i in 1:5) {
  # Test set is the current week
  test_data <- weeks[[i]]
  
  # Train set is the other four weeks combined
  train_data <- do.call(rbind, weeks[-i])
  
  # Fit the model on the training set
  reg4.cv <- lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + 
                        Precipitation + lagHour + lag2Hours + lag3Hours + lag1day,
             data = train_data)
  
  # Make predictions on the test set
  test_data$predictions <- predict(reg4.cv, newdata = test_data)
  
  # Store the test results (actual vs. predicted)
  results[[i]] <- data.frame(Actual = test_data$Trip_Count, Predicted = test_data$predictions)
}

# Optionally, calculate performance metrics like MAE, RMSE for each fold
library(Metrics)
mae_results <- sapply(results, function(res) mae(na.omit(res$Actual), na.omit(res$Predicted)))
rmse_results <- sapply(results, function(res) rmse(na.omit(res$Actual), na.omit(res$Predicted)))
# Combine results into a data frame
results_df <- data.frame(
  Fold = 1:length(mae_results),
  MAE = mae_results,
  RMSE = rmse_results
)
# Calculate the mean of MAE and RMSE
mean_mae <- round(mean(mae_results), 2)
mean_rmse <- round(mean(rmse_results), 2)

# Add the mean values as a new row
results_df <- rbind(
  results_df,
  data.frame(Fold = "Mean", MAE = mean_mae, RMSE = mean_rmse)
)

results_df %>%
  mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
  knitr::kable(caption = "<b>Table 2 Results of Cross-Validation</b>", align = "cc") %>%
  kableExtra::kable_material(lightable_options = c("striped", "hover"))

**Table 2 Results of Cross-Validation**
Fold	MAE	RMSE
1	0.61	0.92
2	0.71	1.12
3	0.76	1.17
4	0.71	1.10
5	0.73	1.12
Mean	0.70	1.09

5 Conclusion

The time lag variables play a key role in building the predictive model, providing valuable insights for the bike re-balancing plan. By accurately forecasting bike-share trip patterns based on time, weather, and amenity factors, the model helps identify when and where bikes are most needed. The high predictive accuracy, as reflected in the low MAE and RMSE, demonstrates the model’s ability to reliably identify locations with high demand. This enables operators to allocate resources more efficiently, ensuring that bikes are available where and when they are needed, thus enhancing the overall efficiency of the bike-share system. However, the predictions during PM rush hours and midday on weekends show higher MAE compared to other time periods, suggesting an opportunity for further model improvement.

Bike Share Prediction

Philadelphia, 2022

Luming Xu

2024-11-15