I’d been curious about generalised additive (mixed) models for some time, and the opportunity to learn more about them finally presented itself when a new project came my way, as part of my work at The Data Lab. The aim of this project was to understand the pattern of visitors recorded at two historic sites in Edinburgh: Edinburgh and Craigmillar Castles – both of which are managed by Historic Environment Scotland (HES).
By ‘understand‘ the pattern of visitors, I really mean ‘predict‘ it on the basis of several ‘reasonable’ predictor variables (which I will detail a bit later). However, it is perhaps worth starting off with a simpler model, that predicts (or in this case, forecasts) visitor numbers from… visitor numbers in the past. In a sense, this is similar to a classic time series scenario, where we would “use the data to predict itself”, without resorting to “external” predictors.
We will begin our discussion by first having a quick look at the data available. Then we will very briefly pause on some modelling aspects, to then have a more meaningful discussion of our data forecast (created in
R using package
mgcv). Afterwards, we can then discuss the additional sources of data that were used as predictors in a more complex, subsequent model. This is the overall structure we will follow:
- 1. The data
- 2. Types of models
- 3. Forecast model
- 4. Explanatory model
- 5. Going further: Extra materials and Newcastle meetup slides
Our monthly time series presents itself in long format, and is split by visitors’ country of origin as well as the type of ticket they purchased to gain entry to either of the two sites. Thus each row in the dataset details how many visitors were recorded:
- for a given month (starting with March 2012 for Edinburgh, and March 2013 for Craigmillar Castle, until March 2018),
- from a given country (UK, for internal tourism, but also USA or Italy etc.)
- purchasing a given ticket type (Walk up, or Explorer Pass etc.)
- visiting a given site (Edinburgh or Craigmillar).
To build our first, simpler
gamm model, we will collapse visitor numbers across country and ticket type, and only look at variations in the total number of visitors per month for each site. This collapsed data looks like this if plotted in
So what are
gamm models? To get a better idea, let’s have a look at where they fit within a conceptual progression of other models:
Types of models
If trying to predict an outcome
y via multiple linear regression on the basis of two predictor variables
x2, our model would have this general form:
- y = b0 + b1x1 + b2x2 + e
R syntax, a model of this nature could look like:
lm_mod <- lm( Visitors ~ Temperature + Rainfall, data = dat )
As the name suggests, this type of model assumes linear relationships between variables (which, as we’ve seen from the plot above, is not the case here!), as well as independent observations – which, again, is highly unlikely in our case (as visitor numbers from one month will have some relationship to the following month).
Generalised additive models (GAMs)
Under these circumstances, enter
gam models, which have this general form:
- y = b0 + f1(x1) + f2(x2) + e
As you will have noticed, in this case, the single b coefficients have been replaced with entire (smooth) functions or splines. These in turn consist of smaller basis functions. Multiple types of basis functions exist (and are suitable for various data problems), and can be chosen through the mgcv package in R. These smooth functions allow to follow the shape of the data much more closely as they are not constrained by the assumption of linearity (unlike the previous type of model).
R syntax, a
gam could appear as:
library( mgcv ) gam_mod <- gam( Visitors ~ s( Month ) + s( as.numeric( Date ) ) + te( Month, as.numeric( Date ) ) + # But see ?te and ?ti Site, data = dat )
gam models do still assume that data points are independent – which for time series data is not realistic. For this reason, we now turn to
gamm (generalised additive mixed) models – also supported by package
Generalised additive mixed models (GAMMs)
These allow the same flexibility of
gam models (in terms of integrating smooths), as well as correlated data points. This can be achieved by specifying various types of autoregressive correlation structures, via functionality already present in the separate
nlme::lme() function, meant for fitting linear mixed models (LMMs).
Unlike (seasonal) ARIMA models, with
gamms we needn’t concern ourselves with differencing or detrending the time series – we just need to take these elements correctly into account as part of the model itself. One way to do so is to use both the month (values cycling from
12), as well as the overall date as predictors, to capture the seasonality and trend aspects of the data, respectively. If we believe that the amount of seasonality may change over time, we can also add an interaction between the month and date. Finally, we can also specify various autoregressive correlation structures into our
gamm, as follows:
# mgcv::gamm() = nlme::lme() + mgcv::gam() gamm_mod <- gamm( Visitors ~ s( Month, bs = "cc" ) + s( as.numeric( Date ) ) + te( Month, as.numeric ( Date ) ) + Site, data = dat, correlation = corARMA( p = 1, q = 0 ) )
We can now essentially apply a similar
gamm to the one above to our data, the key difference is that we can allow the shape of the smooths to vary by Site (Edinburgh or Craigmillar), by specifying the predictors within the model for instance as:
s( Month, bs = "cc", by = Site ), where the
bs = "cc" argument refers to the choice of basis type – in this case, cyclic cubic regression splines which are useful to ensure that December and January line up. In our model, we can also add in a main effect for
All this will have been done after standardising the data separately within each site, to avoid results being distorted by the huge scale difference in visitor numbers between sites. At the end of this process, the output we get from our
gamm is this:
Family: gaussian Link function: identity Formula: Visitors ~ s(Month, bs = "cc", by = Site) + s(as.numeric(Date), by = Site) + te(Month, as.numeric(Date), by = Site) + Site Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.04312 0.03878 -1.112 0.2687 SiteEdinburgh 0.08738 0.05119 1.707 0.0908 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Month):SiteCraigmillar 7.274 8.000 22.707 < 0.0000000000000002 *** s(Month):SiteEdinburgh 6.409 8.000 63.818 < 0.0000000000000002 *** s(as.numeric(Date)):SiteCraigmillar 1.001 1.001 38.212 0.00000000931069 *** s(as.numeric(Date)):SiteEdinburgh 1.000 1.000 57.220 0.00000000000612 *** te(Month,as.numeric(Date)):SiteCraigmillar 6.276 6.276 2.935 0.00787 ** te(Month,as.numeric(Date)):SiteEdinburgh 3.212 3.212 3.316 0.02018 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.913 Scale est. = 0.081956 n = 132
The output is split between ‘parametric’ (unsmoothed) coefficients and smooth terms. A key concept here is that of Effective Degrees of Freedom (EDF), which essentially tell you how ‘wiggly’ the fitted line is. For an EDF of 1, the predictor was estimated to have a linear relationship to the outcome. Importantly,
mgcv will penalise overly wiggly lines to avoid overfitting, which suggests you can wrap all your continuous predictors within smoothing functions, and the model will determine whether/to what extent the data supports a wiggly shape.
As a measure of overall fit for the
gamm model, we also get an Adjusted R-squared at the end of the output (other measures such as GCV – or Generalised Cross Validation – are offered for
gam models, but absent for
gamm – details in my slides below). Judging by this, our model is doing a good job of describing our data, so we can move on to generating a forecast as well… After converting the standardised values back to the original scale, this is how our prediction compares to the raw visitor numbers:
In order to check whether other variables may also contribute/relate to the pattern of visitors we have observed above, several types of data were collected from the following sources:
- The Consumer Confidence Index (CCI) developed by the Organisation for Economic Co-operation and Development (OECD) – is an indicator of how confident households across various countries are in their financial situation, over time.
- Google Trends R package
gtrendsR– measuring the number of hits over time and from various countries, for the queries: “Edinburgh Castle” and “Craigmillar Castle”.
- Global Data on Events, Language and Tone (GDELT) – this is a massive dataset measuring the tone and impact of worldwide news events as extracted from news articles. Data related to only Scottish events were selected.
- Internet Movie Database + Open Movie Database – these sources were scoured for data on productions related to Scotland (in terms of their plot or filming locations).
- Local weather data
- For-Sight hotel booking data, including information on the number of nights or rooms booked across four major hotels in Edinburgh
These datasets were merged with the HES visitor data by date, (and where applicable) country and site. It was difficult to find datasets covering large intervals of time – hence it was often the case that with every successive data merge, the interval covered would get narrower… So to reduce the chance of overfitting, only one measure per data source was retained for modelling purposes. Variables were selected via exploratory factor analysis (EFA) and by fitting a single factor per data source in order to identify the variable that loaded onto it the highest.
So can these extra predictors explain anything above and beyond the previous, simpler model? Let’s have a look at the final
gamm model, which had the following form:
gamm( Visitors ~ s( Month, bs = "cc", by = Site ) + s( as.numeric( Date ), by = TicketType, bs = "gp" ) + te( Month, as.numeric( Date ), by = Site ) + s( NumberOfAdults ) + s( Temperature ) + Site + TicketType + s( LaggedCCI ) + s( LaggedNumArticles ) + s( LaggedimdbVotes ) + s( Laggedhits, by = Site ), data = best_lag_solution, control = lmeControl( opt = "optim", msMaxIter = 10000 ), random = list( GroupingFactor = ~ 1 ), # GroupingFactor = Site x TicketType x Country REML = TRUE, correlation = corARMA( p = 1, q = 0 ) )
Based on the output for this model (shown below), we can check in the parametric coefficients section how the various ticket types compare in popularity to Walk Up tickets (used as the baseline category), or how Edinburgh Castle compared to Craigmillar (the latter functioning as the baseline in this case). In the smooth terms section, notably, we have allowed the spline for
Date to vary depending on the ticket type concerned – which was useful given the surprising EDF for Membership tickets…
Family: gaussian Link function: identity Formula: Visitors ~ s(Month, bs = "cc", by = Site) + s(as.numeric(Date), by = TicketType, bs = "gp") + te(Month, as.numeric(Date), by = Site) + s(NumberOfAdults) + s(Temperature) + Site + TicketType + s(LaggedCCI) + s(LaggedNumArticles) + s(LaggedimdbVotes) + s(Laggedhits, by = Site) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.06296 0.06656 0.946 0.344439 SiteEdinburgh 0.22692 0.06350 3.573 0.000372 *** TicketTypeEducation 0.07420 0.14545 0.510 0.610085 TicketTypeExplorer Pass -0.22165 0.06590 -3.363 0.000804 *** TicketTypeMembership -0.43625 0.06684 -6.527 1.14e-10 *** TicketTypePre-Paid -0.93159 0.14608 -6.377 2.91e-10 *** TicketTypeTrade -0.09395 0.06916 -1.358 0.174698 TicketTypeWeb -0.10447 0.06915 -1.511 0.131229 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Month):SiteCraigmillar 4.954928 8.000 7.941 6.94e-15 *** s(Month):SiteEdinburgh 7.281621 8.000 20.031 < 2e-16 *** s(as.numeric(Date)):TicketTypeWalk Up 1.000001 1.000 2.854 0.09149 . s(as.numeric(Date)):TicketTypeEducation 1.000002 1.000 1.658 0.19821 s(as.numeric(Date)):TicketTypeExplorer Pass 1.000014 1.000 2.690 0.10132 s(as.numeric(Date)):TicketTypeMembership 7.325579 7.326 27.817 < 2e-16 *** s(as.numeric(Date)):TicketTypePre-Paid 1.000004 1.000 0.231 0.63072 s(as.numeric(Date)):TicketTypeTrade 1.000000 1.000 6.328 0.01206 * s(as.numeric(Date)):TicketTypeWeb 1.000000 1.000 22.546 2.39e-06 *** te(Month,as.numeric(Date)):SiteCraigmillar 0.001085 15.000 0.000 0.31354 te(Month,as.numeric(Date)):SiteEdinburgh 5.443975 15.000 2.090 4.26e-07 *** s(NumberOfAdults) 1.000009 1.000 40.617 2.93e-10 *** s(Temperature) 5.675452 5.675 3.058 0.00436 ** s(LaggedCCI) 1.000019 1.000 3.921 0.04798 * s(LaggedNumArticles) 1.985548 1.986 5.661 0.00598 ** s(LaggedimdbVotes) 2.702820 2.703 5.187 0.00188 ** s(Laggedhits):SiteCraigmillar 1.000011 1.000 8.012 0.00475 ** s(Laggedhits):SiteEdinburgh 3.106882 3.107 3.756 0.01092 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.694 Scale est. = 0.3072 n = 932
You can see this pattern below, in a plot created with the related package
But disappointingly, the Adjusted R-squared for this more complex model is lower than for the previous one – which suggests that the added variables (collectively) are not that successful in predicting visitor numbers at the two castles.
An important part of building
gamm models is carrying out assumption checks – a lot of information about this is present if you consult:
Something else that is interesting to consider would be how to validate forecast results. You can check out the idea of evaluating on a rolling forecasting origin or various caveats for time series validation.
Finally, I have only really scratched the surface of
gamm models – and so much more is worth exploring. So here are a few great resources you can check out:
- Mitchell Lyons’ introduction to GAMs, which includes an interesting discussion of basic functions and what you can find within
gamobjects in R
- Gavin Simpson’s blog
- Noam Ross – Nonlinear Models in R: The Wonderful World of mgcv
- Noam Ross’ Intro Course on GAMs
- Josef Fruehwald – Studying Pronunciation Changes with gamms