I’d been curious about generalised additive (mixed) models for some time, and the opportunity to learn more about them finally presented itself when a new project came my way, as part of my work at The Data Lab. The aim of this project was to understand the pattern of visitors recorded at two historic sites in Edinburgh: Edinburgh and Craigmillar Castles – both of which are managed by Historic Environment Scotland (HES).

By ‘understand‘ the pattern of visitors, I really mean ‘predict‘ it on the basis of several ‘reasonable’ predictor variables (which I will detail a bit later). However, it is perhaps worth starting off with a simpler model, that predicts (or in this case, forecasts) visitor numbers from… visitor numbers in the past. In a sense, this is similar to a classic time series scenario, where we would “use the data to predict itself”, without resorting to “external” predictors.

We will begin our discussion by first having a quick look at the data available. Then we will very briefly pause on some modelling aspects, to then have a more meaningful discussion of our data forecast (created in R using package mgcv). Afterwards, we can then discuss the additional sources of data that were used as predictors in a more complex, subsequent model. This is the overall structure we will follow:

1. The data
2. Types of models
3. Forecast model
4. Explanatory model
5. Going further: Extra materials and Newcastle meetup slides

The data

Our monthly time series presents itself in long format, and is split by visitors’ country of origin as well as the type of ticket they purchased to gain entry to either of the two sites. Thus each row in the dataset details how many visitors were recorded:

for a given month (starting with March 2012 for Edinburgh, and March 2013 for Craigmillar Castle, until March 2018),
from a given country (UK, for internal tourism, but also USA or Italy etc.)
purchasing a given ticket type (Walk up, or Explorer Pass etc.)
visiting a given site (Edinburgh or Craigmillar).

To build our first, simpler gamm model, we will collapse visitor numbers across country and ticket type, and only look at variations in the total number of visitors per month for each site. This collapsed data looks like this if plotted in R using ggplot2:

Edinburgh Craigmillar linegraph facets graph — These data exhibit an interesting pattern of seasonality over summer (for both castles) and around Easter (especially for Craigmillar Castle), as well as a general – but modest – upward trend. But will our gamm model pick up on these aspects correctly?

So what are gamm models? To get a better idea, let’s have a look at where they fit within a conceptual progression of other models:

Types of models

Linear regression

If trying to predict an outcome y via multiple linear regression on the basis of two predictor variables x₁ and x₂, our model would have this general form:

y = b₀ + b₁x₁ + b₂x₂ + e

Translated into R syntax, a model of this nature could look like:

lm_mod <- lm( Visitors ~ Temperature + Rainfall, data = dat )

As the name suggests, this type of model assumes linear relationships between variables (which, as we’ve seen from the plot above, is not the case here!), as well as independent observations – which, again, is highly unlikely in our case (as visitor numbers from one month will have some relationship to the following month).

Generalised additive models (GAMs)

Under these circumstances, enter gam models, which have this general form:

y = b₀ + f₁(x₁) + f₂(x₂) + e

As you will have noticed, in this case, the single b coefficients have been replaced with entire (smooth) functions or splines. These in turn consist of smaller basis functions. Multiple types of basis functions exist (and are suitable for various data problems), and can be chosen through the mgcv package in R. These smooth functions allow to follow the shape of the data much more closely as they are not constrained by the assumption of linearity (unlike the previous type of model).

Using R syntax, a gam could appear as:

library( mgcv )
gam_mod <- gam( Visitors ~ s( Month ) + 
                  s( as.numeric( Date ) ) + 
                  te( Month, as.numeric( Date ) ) + # But see ?te and ?ti 
                  Site, 
                data = dat )

However, gam models do still assume that data points are independent – which for time series data is not realistic. For this reason, we now turn to gamm (generalised additive mixed) models – also supported by package mgcv in R.

Generalised additive mixed models (GAMMs)

These allow the same flexibility of gam models (in terms of integrating smooths), as well as correlated data points. This can be achieved by specifying various types of autoregressive correlation structures, via functionality already present in the separate nlme::lme() function, meant for fitting linear mixed models (LMMs).

Unlike (seasonal) ARIMA models, with gamms we needn’t concern ourselves with differencing or detrending the time series – we just need to take these elements correctly into account as part of the model itself. One way to do so is to use both the month (values cycling from 1 to 12), as well as the overall date as predictors, to capture the seasonality and trend aspects of the data, respectively. If we believe that the amount of seasonality may change over time, we can also add an interaction between the month and date. Finally, we can also specify various autoregressive correlation structures into our gamm, as follows:

# mgcv::gamm() = nlme::lme() + mgcv::gam()
gamm_mod <- gamm( Visitors ~
                       s( Month, bs = "cc" ) +
                       s( as.numeric( Date ) ) +
                       te( Month, as.numeric ( Date ) ) +
                       Site,
                     data = dat,
                     correlation = corARMA( p = 1, q = 0 ) )

Forecast model

We can now essentially apply a similar gamm to the one above to our data, the key difference is that we can allow the shape of the smooths to vary by Site (Edinburgh or Craigmillar), by specifying the predictors within the model for instance as: s( Month, bs = "cc", by = Site ), where the bs = "cc" argument refers to the choice of basis type – in this case, cyclic cubic regression splines which are useful to ensure that December and January line up. In our model, we can also add in a main effect for Site.

All this will have been done after standardising the data separately within each site, to avoid results being distorted by the huge scale difference in visitor numbers between sites. At the end of this process, the output we get from our gamm is this:

Family: gaussian 
Link function: identity 

Formula:
Visitors ~ s(Month, bs = "cc", by = Site) + s(as.numeric(Date), 
    by = Site) + te(Month, as.numeric(Date), by = Site) + Site

Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)   -0.04312    0.03878  -1.112   0.2687  
SiteEdinburgh  0.08738    0.05119   1.707   0.0908 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                                             edf Ref.df      F              p-value    
s(Month):SiteCraigmillar                   7.274  8.000 22.707 < 0.0000000000000002 ***
s(Month):SiteEdinburgh                     6.409  8.000 63.818 < 0.0000000000000002 ***
s(as.numeric(Date)):SiteCraigmillar        1.001  1.001 38.212     0.00000000931069 ***
s(as.numeric(Date)):SiteEdinburgh          1.000  1.000 57.220     0.00000000000612 ***
te(Month,as.numeric(Date)):SiteCraigmillar 6.276  6.276  2.935              0.00787 ** 
te(Month,as.numeric(Date)):SiteEdinburgh   3.212  3.212  3.316              0.02018 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.913   
  Scale est. = 0.081956  n = 132

The output is split between ‘parametric’ (unsmoothed) coefficients and smooth terms. A key concept here is that of Effective Degrees of Freedom (EDF), which essentially tell you how ‘wiggly’ the fitted line is. For an EDF of 1, the predictor was estimated to have a linear relationship to the outcome. Importantly, mgcv will penalise overly wiggly lines to avoid overfitting, which suggests you can wrap all your continuous predictors within smoothing functions, and the model will determine whether/to what extent the data supports a wiggly shape.

As a measure of overall fit for the gamm model, we also get an Adjusted R-squared at the end of the output (other measures such as GCV – or Generalised Cross Validation – are offered for gam models, but absent for gamm – details in my slides below). Judging by this, our model is doing a good job of describing our data, so we can move on to generating a forecast as well… After converting the standardised values back to the original scale, this is how our prediction compares to the raw visitor numbers:

PredVsObsSimplerGAMM — While the prediction produced follows the original data quite closely, it’s worth noting the confidence intervals are impractically large and (following the conversion back to the original scale), also dip below 0, which for visitor numbers makes little sense. Possible solutions could be using a different family option in gamm() (perhaps Poisson or Quasi-Poisson), or at least truncating the lower bound of the confidence intervals.

Explanatory model

In order to check whether other variables may also contribute/relate to the pattern of visitors we have observed above, several types of data were collected from the following sources:

The Consumer Confidence Index (CCI) developed by the Organisation for Economic Co-operation and Development (OECD) – is an indicator of how confident households across various countries are in their financial situation, over time.
Google Trends R package gtrendsR – measuring the number of hits over time and from various countries, for the queries: “Edinburgh Castle” and “Craigmillar Castle”.
Global Data on Events, Language and Tone (GDELT) – this is a massive dataset measuring the tone and impact of worldwide news events as extracted from news articles. Data related to only Scottish events were selected.
Internet Movie Database + Open Movie Database – these sources were scoured for data on productions related to Scotland (in terms of their plot or filming locations).
Local weather data
For-Sight hotel booking data, including information on the number of nights or rooms booked across four major hotels in Edinburgh

These datasets were merged with the HES visitor data by date, (and where applicable) country and site. It was difficult to find datasets covering large intervals of time – hence it was often the case that with every successive data merge, the interval covered would get narrower… So to reduce the chance of overfitting, only one measure per data source was retained for modelling purposes. Variables were selected via exploratory factor analysis (EFA) and by fitting a single factor per data source in order to identify the variable that loaded onto it the highest.

So can these extra predictors explain anything above and beyond the previous, simpler model? Let’s have a look at the final gamm model, which had the following form:

gamm( Visitors ~ 
      s( Month, bs = "cc", by = Site )  +
      s( as.numeric( Date ), by = TicketType, bs = "gp" ) +
      te( Month, as.numeric( Date ), by = Site ) +
      s( NumberOfAdults ) +
      s( Temperature ) +
      Site +
      TicketType +
      s( LaggedCCI ) + 
      s( LaggedNumArticles ) +
      s( LaggedimdbVotes ) + 
      s( Laggedhits, by = Site ),

    data = best_lag_solution,
    control = lmeControl( opt = "optim", msMaxIter = 10000 ),
    random = list( GroupingFactor = ~ 1 ), # GroupingFactor = Site x TicketType x Country
    REML = TRUE,
    correlation = corARMA( p = 1, q = 0 )
  )

Based on the output for this model (shown below), we can check in the parametric coefficients section how the various ticket types compare in popularity to Walk Up tickets (used as the baseline category), or how Edinburgh Castle compared to Craigmillar (the latter functioning as the baseline in this case). In the smooth terms section, notably, we have allowed the spline for Date to vary depending on the ticket type concerned – which was useful given the surprising EDF for Membership tickets…

Family: gaussian 
Link function: identity 

Formula:
Visitors ~ s(Month, bs = "cc", by = Site) + s(as.numeric(Date), 
    by = TicketType, bs = "gp") + te(Month, as.numeric(Date), 
    by = Site) + s(NumberOfAdults) + s(Temperature) + Site + 
    TicketType + s(LaggedCCI) + s(LaggedNumArticles) + s(LaggedimdbVotes) + 
    s(Laggedhits, by = Site)

Parametric coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.06296    0.06656   0.946 0.344439    
SiteEdinburgh            0.22692    0.06350   3.573 0.000372 ***
TicketTypeEducation      0.07420    0.14545   0.510 0.610085    
TicketTypeExplorer Pass -0.22165    0.06590  -3.363 0.000804 ***
TicketTypeMembership    -0.43625    0.06684  -6.527 1.14e-10 ***
TicketTypePre-Paid      -0.93159    0.14608  -6.377 2.91e-10 ***
TicketTypeTrade         -0.09395    0.06916  -1.358 0.174698    
TicketTypeWeb           -0.10447    0.06915  -1.511 0.131229    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                                                 edf Ref.df      F  p-value    
s(Month):SiteCraigmillar                    4.954928  8.000  7.941 6.94e-15 ***
s(Month):SiteEdinburgh                      7.281621  8.000 20.031  < 2e-16 ***
s(as.numeric(Date)):TicketTypeWalk Up       1.000001  1.000  2.854  0.09149 .  
s(as.numeric(Date)):TicketTypeEducation     1.000002  1.000  1.658  0.19821    
s(as.numeric(Date)):TicketTypeExplorer Pass 1.000014  1.000  2.690  0.10132    
s(as.numeric(Date)):TicketTypeMembership    7.325579  7.326 27.817  < 2e-16 ***
s(as.numeric(Date)):TicketTypePre-Paid      1.000004  1.000  0.231  0.63072    
s(as.numeric(Date)):TicketTypeTrade         1.000000  1.000  6.328  0.01206 *  
s(as.numeric(Date)):TicketTypeWeb           1.000000  1.000 22.546 2.39e-06 ***
te(Month,as.numeric(Date)):SiteCraigmillar  0.001085 15.000  0.000  0.31354    
te(Month,as.numeric(Date)):SiteEdinburgh    5.443975 15.000  2.090 4.26e-07 ***
s(NumberOfAdults)                           1.000009  1.000 40.617 2.93e-10 ***
s(Temperature)                              5.675452  5.675  3.058  0.00436 ** 
s(LaggedCCI)                                1.000019  1.000  3.921  0.04798 *  
s(LaggedNumArticles)                        1.985548  1.986  5.661  0.00598 ** 
s(LaggedimdbVotes)                          2.702820  2.703  5.187  0.00188 ** 
s(Laggedhits):SiteCraigmillar               1.000011  1.000  8.012  0.00475 ** 
s(Laggedhits):SiteEdinburgh                 3.106882  3.107  3.756  0.01092 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.694   Scale est. = 0.3072    n = 932

You can see this pattern below, in a plot created with the related package itsadug:

But disappointingly, the Adjusted R-squared for this more complex model is lower than for the previous one – which suggests that the added variables (collectively) are not that successful in predicting visitor numbers at the two castles.

Going further

An important part of building gam or gamm models is carrying out assumption checks – a lot of information about this is present if you consult: ?mgcv::gam.check and ?mgcViz::check.gamViz.

Something else that is interesting to consider would be how to validate forecast results. You can check out the idea of evaluating on a rolling forecasting origin or various caveats for time series validation.

Finally, I have only really scratched the surface of gam / gamm models – and so much more is worth exploring. So here are a few great resources you can check out:

Mitchell Lyons’ introduction to GAMs, which includes an interesting discussion of basic functions and what you can find within gam objects in R
Gavin Simpson’s blog
Noam Ross – Nonlinear Models in R: The Wonderful World of mgcv
Noam Ross’ Intro Course on GAMs
Josef Fruehwald – Studying Pronunciation Changes with gamms

Comments

Anonymous says

18/07/2019 at 10:32

This is great thanks (wish it was written years ago when I had to do some stuff with GAMs!). Do you think this type of model can outperform the more standard time series models for forecasting? I think its also used as the basis for the prophet package so there are definitely good examples of it being used for forecasting. And do you have some resources for exploratory factor analysis (I’ve never done it)?
datascienceteam@thedatalab.com says

23/07/2019 at 16:04

Hi, glad you enjoyed the post! I think it really depends on the data and the research question, unfortunately, and how you define ‘performance’. Both are well-established techniques, and operate on different principles… So I would hesitate before making any high-level statement about which may be ‘better’. As for prophet, I have yet to use this package myself, but yes – there are indeed similarities, although prophet relies on a Bayesian framework. As for factor analysis, perhaps a good place to start would be William Revelle’s overview of R package psych here. Hope this helps!
Anonymous says

30/07/2019 at 10:56

Many thanks, understand that it’s difficult to say if a statistical method is ‘better’ as its always more a case of the ‘right tool for the right job’. I’ve seen the psych package mentioned a few times recently so looks like it’ll be worth looking into!
Anonymous says

27/02/2021 at 04:17

Thanks for sharing your work, it’s great. In this context, if in GAMMs we can’t have the GCV parameter, what is the way to check the model in a framework of cross-validation?
datascienceteam@thedatalab.com says

03/03/2021 at 10:37

Hi, and thanks for commenting. It’s an interesting question you ask. Caterina, who originally performed the work, has since moved on to a new opportunity. The team has spent a little time digging into this, and we’re seeing the same issue — the package Caterina used does not include a GCV parameter but instead uses an adjusted r-squared. What we’ve not managed to find is either an overview of why this is, or what the recommended process is for GAMMS.

I’ve reached out to Caterina to bring her attention to this comment, and she may be able to add some additional detail here. In the meantime the best I can suggest is to follow what Caterina’s methodology here was and make use of the modified r-squared parameter in your optimisation process.

Joanna
datascienceteam@thedatalab.com says

08/03/2021 at 10:02

Hi,
I’ve heard back from Caterina, who has kindly agreed to provide some more depth on this matter. Her response is as follows:

=======================================

Great question and thanks for stopping by. I think two related issues are involved here – first, model assessment in terms of the output from `summary()`, and second, ways to carry out model validation. In terms of assessing a given model, indeed, GCV is not outputted for `gamm` models. This is related to the estimation method which is different between `gam()` and `gamm()` – so that the former relies on GCV smoothness estimation by default, whereas the latter – on REML/PQL. Hence why in the output GCV is absent for `gamm` models, but present for gam. One can force the use of REML in `gam()` models as well, and in those cases, the GCV value also vanishes from the model summary. You can test this in this example inspired by the mgcv docs:

library(mgcv)
set.seed(2) ## simulate some data…
dat <- gamSim(1, n=400, dist="normal", scale = 2) b1 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat) b2 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat, method = "REML") summary(b1) summary(b2) In terms of actual validation, for time series data such as the data in this post, I would avoid shuffling the data (which happens with e.g., k-fold cross validation). Instead, a better way to go would be to use something like rsample::sliding_window(), which will prepare sequential validation sets for you, based on some parameters you supply. You can then test your model on the subsets generated with `sliding_window()`, by extracting them as needed with `analysis()` and `assessment()`. ==================================================
smart contract hacked says

07/05/2023 at 06:33

Ꭺll is a solution thаt allоws companies to validate tһе authenticity of transactions
Nicola Burns says

14/05/2024 at 15:11

Was a correction for multiple comparisons done? If so, how? and if not, why not?

About Us

For Business

The Data Lab Academy

For Universities and Colleges

Community

Using Generalised Additive Mixed Models (GAMMs) to Predict Visitors to Edinburgh and Craigmillar Castles

The data

Types of models

Linear regression

Generalised additive models (GAMs)

Generalised additive mixed models (GAMMs)

Forecast model

Explanatory model

Going further

Get in touch

Follow us on social

About Us

For Business

The Data Lab Academy

For Universities and Colleges

Community

The data

Types of models

Linear regression

Generalised additive models (GAMs)

Generalised additive mixed models (GAMMs)

Forecast model

Explanatory model

Going further

Reader Interactions

Comments

Leave a Reply

Get in touch

Follow us on social