I’d been curious about **generalised additive (mixed) models** for some time, and the opportunity to learn more about them finally presented itself when a new project came my way, as part of my work at The Data Lab. The aim of this project was to understand the pattern of visitors recorded at two historic sites in Edinburgh: Edinburgh and Craigmillar Castles – both of which are managed by Historic Environment Scotland (HES).

By ‘*understand*‘ the pattern of visitors, I really mean ‘*predict*‘ it on the basis of several ‘reasonable’ predictor variables (which I will detail a bit later). However, it is perhaps worth starting off with a simpler model, that predicts (or in this case, *forecasts*) visitor numbers from… visitor numbers in the past. In a sense, this is similar to a classic time series scenario, where we would “use the data to predict itself”, without resorting to “external” predictors.

We will begin our discussion by first having a quick look at the data available. Then we will very briefly pause on some modelling aspects, to then have a more meaningful discussion of our data forecast (created in `R`

using package `mgcv`

). Afterwards, we can then discuss the additional sources of data that were used as predictors in a more complex, subsequent model. This is the overall structure we will follow:

- 1. The data
- 2. Types of models
- 3. Forecast model
- 4. Explanatory model
- 5. Going further: Extra materials and Newcastle meetup slides

# The data

Our *monthly* time series presents itself in long format, and is split by visitors’ country of origin as well as the type of ticket they purchased to gain entry to either of the two sites. Thus each row in the dataset details how many visitors were recorded:

- for a given month (starting with March 2012 for Edinburgh, and March 2013 for Craigmillar Castle, until March 2018),
- from a given country (UK, for internal tourism, but also USA or Italy etc.)
- purchasing a given ticket type (Walk up, or Explorer Pass etc.)
- visiting a given site (Edinburgh or Craigmillar).

To build our first, simpler `gamm`

model, we will collapse visitor numbers across country and ticket type, and only look at variations in the total number of visitors per month for each site. This collapsed data looks like this if plotted in `R`

using `ggplot2`

:

So what are `gamm`

models? To get a better idea, let’s have a look at where they fit within a conceptual progression of other models:

# Types of models

## Linear regression

If trying to predict an outcome `y`

via multiple linear regression on the basis of two predictor variables `x`

_{1} and `x`

_{2}, our model would have this general form:

- y = b
_{0}+ b_{1}x_{1}+ b_{2}x_{2}+ e

Translated into `R`

syntax, a model of this nature could look like:

lm_mod <- lm( Visitors ~ Temperature + Rainfall, data = dat )

As the name suggests, this type of model assumes **linear** relationships between variables (which, as we’ve seen from the plot above, is not the case here!), as well as independent observations – which, again, is highly unlikely in our case (as visitor numbers from one month will have some relationship to the following month).

## Generalised additive models (GAMs)

Under these circumstances, enter `gam`

models, which have this general form:

- y = b
_{0}+ f_{1}(x_{1}) + f_{2}(x_{2}) + e

As you will have noticed, in this case, the single b coefficients have been replaced with entire (smooth) functions or splines. These in turn consist of smaller basis functions. Multiple types of basis functions exist (and are suitable for various data problems), and can be chosen through the mgcv package in R. These smooth functions allow to follow the shape of the data much more closely as they are not constrained by the assumption of linearity (unlike the previous type of model).

Using `R`

syntax, a `gam`

could appear as:

library( mgcv ) gam_mod <- gam( Visitors ~ s( Month ) + s( as.numeric( Date ) ) + te( Month, as.numeric( Date ) ) + # But see ?te and ?ti Site, data = dat )

However, `gam`

models do still assume that data points are independent – which for time series data is not realistic. For this reason, we now turn to `gamm`

(generalised additive **mixed**) models – also supported by package `mgcv`

in `R`

.

## Generalised additive mixed models (GAMMs)

These allow the same flexibility of `gam`

models (in terms of integrating smooths), as well as correlated data points. This can be achieved by specifying various types of autoregressive correlation structures, via functionality already present in the separate `nlme::lme()`

function, meant for fitting linear mixed models (LMMs).

Unlike (seasonal) ARIMA models, with `gamms`

we needn’t concern ourselves with differencing or detrending the time series – we just need to take these elements correctly into account as part of the model itself. One way to do so is to use both the month (values cycling from `1`

to `12`

), as well as the overall date as predictors, to capture the seasonality and trend aspects of the data, respectively. If we believe that the amount of seasonality may change over time, we can also add an interaction between the month and date. Finally, we can also specify various autoregressive correlation structures into our `gamm`

, as follows:

# mgcv::gamm() = nlme::lme() + mgcv::gam() gamm_mod <- gamm( Visitors ~ s( Month, bs = "cc" ) + s( as.numeric( Date ) ) + te( Month, as.numeric ( Date ) ) + Site, data = dat, correlation = corARMA( p = 1, q = 0 ) )

# Forecast model

We can now essentially apply a similar `gamm`

to the one above to our data, the key difference is that we can allow the shape of the smooths to vary by Site (Edinburgh or Craigmillar), by specifying the predictors within the model for instance as: `s( Month, bs = "cc", by = Site )`

, where the `bs = "cc"`

argument refers to the choice of basis type – in this case, cyclic cubic regression splines which are useful to ensure that December and January line up. In our model, we can also add in a main effect for `Site`

.

All this will have been done after standardising the data separately within each site, to avoid results being distorted by the huge scale difference in visitor numbers between sites. At the end of this process, the output we get from our `gamm`

is this:

Family: gaussian Link function: identity Formula: Visitors ~ s(Month, bs = "cc", by = Site) + s(as.numeric(Date), by = Site) + te(Month, as.numeric(Date), by = Site) + Site Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.04312 0.03878 -1.112 0.2687 SiteEdinburgh 0.08738 0.05119 1.707 0.0908 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Month):SiteCraigmillar 7.274 8.000 22.707 < 0.0000000000000002 *** s(Month):SiteEdinburgh 6.409 8.000 63.818 < 0.0000000000000002 *** s(as.numeric(Date)):SiteCraigmillar 1.001 1.001 38.212 0.00000000931069 *** s(as.numeric(Date)):SiteEdinburgh 1.000 1.000 57.220 0.00000000000612 *** te(Month,as.numeric(Date)):SiteCraigmillar 6.276 6.276 2.935 0.00787 ** te(Month,as.numeric(Date)):SiteEdinburgh 3.212 3.212 3.316 0.02018 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.913 Scale est. = 0.081956 n = 132

The output is split between ‘parametric’ (unsmoothed) coefficients and smooth terms. A key concept here is that of Effective Degrees of Freedom (EDF), which essentially tell you how ‘wiggly’ the fitted line is. For an EDF of 1, the predictor was estimated to have a **linear** relationship to the outcome. Importantly, `mgcv`

will **penalise** overly wiggly lines to avoid overfitting, which suggests you can wrap all your continuous predictors within smoothing functions, and the model will determine whether/to what extent the data supports a wiggly shape.

As a measure of overall fit for the `gamm`

model, we also get an **Adjusted R-squared** at the end of the output (other measures such as GCV – or Generalised Cross Validation – are offered for `gam`

models, but absent for `gamm`

– details in my slides below). Judging by this, our model is doing a good job of describing our data, so we can move on to generating a forecast as well… After converting the standardised values back to the original scale, this is how our prediction compares to the raw visitor numbers:

# Explanatory model

In order to check whether other variables may also contribute/relate to the pattern of visitors we have observed above, several types of data were collected from the following sources:

- The Consumer Confidence Index (CCI) developed by the Organisation for Economic Co-operation and Development (OECD) – is an indicator of how confident households across various countries are in their financial situation, over time.
- Google Trends R package
`gtrendsR`

– measuring the number of hits over time and from various countries, for the queries: “Edinburgh Castle” and “Craigmillar Castle”. - Global Data on Events, Language and Tone (GDELT) – this is a massive dataset measuring the tone and impact of worldwide news events as extracted from news articles. Data related to only Scottish events were selected.
- Internet Movie Database + Open Movie Database – these sources were scoured for data on productions related to Scotland (in terms of their plot or filming locations).
- Local weather data
- For-Sight hotel booking data, including information on the number of nights or rooms booked across four major hotels in Edinburgh

These datasets were merged with the HES visitor data by date, (and where applicable) country and site. It was difficult to find datasets covering large intervals of time – hence it was often the case that with every successive data merge, the interval covered would get narrower… So to reduce the chance of overfitting, only one measure per data source was retained for modelling purposes. Variables were selected via exploratory factor analysis (EFA) and by fitting a single factor per data source in order to identify the variable that loaded onto it the highest.

So can these extra predictors explain anything above and beyond the previous, simpler model? Let’s have a look at the final `gamm`

model, which had the following form:

gamm( Visitors ~ s( Month, bs = "cc", by = Site ) + s( as.numeric( Date ), by = TicketType, bs = "gp" ) + te( Month, as.numeric( Date ), by = Site ) + s( NumberOfAdults ) + s( Temperature ) + Site + TicketType + s( LaggedCCI ) + s( LaggedNumArticles ) + s( LaggedimdbVotes ) + s( Laggedhits, by = Site ), data = best_lag_solution, control = lmeControl( opt = "optim", msMaxIter = 10000 ), random = list( GroupingFactor = ~ 1 ), # GroupingFactor = Site x TicketType x Country REML = TRUE, correlation = corARMA( p = 1, q = 0 ) )

Based on the output for this model (shown below), we can check in the parametric coefficients section how the various ticket types compare in popularity to Walk Up tickets (used as the baseline category), or how Edinburgh Castle compared to Craigmillar (the latter functioning as the baseline in this case). In the smooth terms section, notably, we have allowed the spline for `Date`

to vary depending on the ticket type concerned – which was useful given the surprising EDF for Membership tickets…

Family: gaussian Link function: identity Formula: Visitors ~ s(Month, bs = "cc", by = Site) + s(as.numeric(Date), by = TicketType, bs = "gp") + te(Month, as.numeric(Date), by = Site) + s(NumberOfAdults) + s(Temperature) + Site + TicketType + s(LaggedCCI) + s(LaggedNumArticles) + s(LaggedimdbVotes) + s(Laggedhits, by = Site) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.06296 0.06656 0.946 0.344439 SiteEdinburgh 0.22692 0.06350 3.573 0.000372 *** TicketTypeEducation 0.07420 0.14545 0.510 0.610085 TicketTypeExplorer Pass -0.22165 0.06590 -3.363 0.000804 *** TicketTypeMembership -0.43625 0.06684 -6.527 1.14e-10 *** TicketTypePre-Paid -0.93159 0.14608 -6.377 2.91e-10 *** TicketTypeTrade -0.09395 0.06916 -1.358 0.174698 TicketTypeWeb -0.10447 0.06915 -1.511 0.131229 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Month):SiteCraigmillar 4.954928 8.000 7.941 6.94e-15 *** s(Month):SiteEdinburgh 7.281621 8.000 20.031 < 2e-16 *** s(as.numeric(Date)):TicketTypeWalk Up 1.000001 1.000 2.854 0.09149 . s(as.numeric(Date)):TicketTypeEducation 1.000002 1.000 1.658 0.19821 s(as.numeric(Date)):TicketTypeExplorer Pass 1.000014 1.000 2.690 0.10132 s(as.numeric(Date)):TicketTypeMembership 7.325579 7.326 27.817 < 2e-16 *** s(as.numeric(Date)):TicketTypePre-Paid 1.000004 1.000 0.231 0.63072 s(as.numeric(Date)):TicketTypeTrade 1.000000 1.000 6.328 0.01206 * s(as.numeric(Date)):TicketTypeWeb 1.000000 1.000 22.546 2.39e-06 *** te(Month,as.numeric(Date)):SiteCraigmillar 0.001085 15.000 0.000 0.31354 te(Month,as.numeric(Date)):SiteEdinburgh 5.443975 15.000 2.090 4.26e-07 *** s(NumberOfAdults) 1.000009 1.000 40.617 2.93e-10 *** s(Temperature) 5.675452 5.675 3.058 0.00436 ** s(LaggedCCI) 1.000019 1.000 3.921 0.04798 * s(LaggedNumArticles) 1.985548 1.986 5.661 0.00598 ** s(LaggedimdbVotes) 2.702820 2.703 5.187 0.00188 ** s(Laggedhits):SiteCraigmillar 1.000011 1.000 8.012 0.00475 ** s(Laggedhits):SiteEdinburgh 3.106882 3.107 3.756 0.01092 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.694 Scale est. = 0.3072 n = 932

You can see this pattern below, in a plot created with the related package `itsadug`

:

But disappointingly, the Adjusted R-squared for this more complex model is *lower* than for the previous one – which suggests that the added variables (collectively) are not that successful in predicting visitor numbers at the two castles.

# Going further

An important part of building `gam`

or `gamm`

models is carrying out assumption checks – a lot of information about this is present if you consult: `?mgcv::gam.check`

and `?mgcViz::check.gamViz`

.

Something else that is interesting to consider would be how to validate forecast results. You can check out the idea of evaluating on a rolling forecasting origin or various caveats for time series validation.

Finally, I have only really scratched the surface of `gam`

/ `gamm`

models – and so much more is worth exploring. So here are a few great resources you can check out:

- Mitchell Lyons’ introduction to GAMs, which includes an interesting discussion of basic functions and what you can find within
`gam`

objects in R - Gavin Simpson’s blog
- Noam Ross – Nonlinear Models in R: The Wonderful World of mgcv
- Noam Ross’ Intro Course on GAMs
- Josef Fruehwald – Studying Pronunciation Changes with gamms

Anonymous says

This is great thanks (wish it was written years ago when I had to do some stuff with GAMs!). Do you think this type of model can outperform the more standard time series models for forecasting? I think its also used as the basis for the prophet package so there are definitely good examples of it being used for forecasting. And do you have some resources for exploratory factor analysis (I’ve never done it)?

datascienceteam@thedatalab.com says

Hi, glad you enjoyed the post! I think it really depends on the data and the research question, unfortunately, and how you define ‘performance’. Both are well-established techniques, and operate on different principles… So I would hesitate before making any high-level statement about which may be ‘better’. As for

`prophet`

, I have yet to use this package myself, but yes – there are indeed similarities, although`prophet`

relies on a Bayesian framework. As for factor analysis, perhaps a good place to start would be William Revelle’s overview of R package`psych`

here. Hope this helps!Anonymous says

Many thanks, understand that it’s difficult to say if a statistical method is ‘better’ as its always more a case of the ‘right tool for the right job’. I’ve seen the psych package mentioned a few times recently so looks like it’ll be worth looking into!

Anonymous says

Thanks for sharing your work, it’s great. In this context, if in GAMMs we can’t have the GCV parameter, what is the way to check the model in a framework of cross-validation?

datascienceteam@thedatalab.com says

Hi, and thanks for commenting. It’s an interesting question you ask. Caterina, who originally performed the work, has since moved on to a new opportunity. The team has spent a little time digging into this, and we’re seeing the same issue — the package Caterina used does not include a GCV parameter but instead uses an adjusted r-squared. What we’ve not managed to find is either an overview of why this is, or what the recommended process is for GAMMS.

I’ve reached out to Caterina to bring her attention to this comment, and she may be able to add some additional detail here. In the meantime the best I can suggest is to follow what Caterina’s methodology here was and make use of the modified r-squared parameter in your optimisation process.

Joanna

datascienceteam@thedatalab.com says

Hi,

I’ve heard back from Caterina, who has kindly agreed to provide some more depth on this matter. Her response is as follows:

=======================================

Great question and thanks for stopping by. I think two related issues are involved here – first, model assessment in terms of the output from `summary()`, and second, ways to carry out model validation. In terms of assessing a given model, indeed, GCV is not outputted for `gamm` models. This is related to the estimation method which is different between `gam()` and `gamm()` – so that the former relies on GCV smoothness estimation by default, whereas the latter – on REML/PQL. Hence why in the output GCV is absent for `gamm` models, but present for gam. One can force the use of REML in `gam()` models as well, and in those cases, the GCV value also vanishes from the model summary. You can test this in this example inspired by the mgcv docs:

library(mgcv)

set.seed(2) ## simulate some data…

dat <- gamSim(1, n=400, dist="normal", scale = 2) b1 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat) b2 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat, method = "REML") summary(b1) summary(b2) In terms of actual validation, for time series data such as the data in this post, I would avoid shuffling the data (which happens with e.g., k-fold cross validation). Instead, a better way to go would be to use something like rsample::sliding_window(), which will prepare sequential validation sets for you, based on some parameters you supply. You can then test your model on the subsets generated with `sliding_window()`, by extracting them as needed with `analysis()` and `assessment()`. ==================================================

smart contract hacked says

Ꭺll is a solution thаt allоws companies to validate tһе authenticity of transactions

Nicola Burns says

Was a correction for multiple comparisons done? If so, how? and if not, why not?