The purpose of this post is to implement “Nowcasting” in Stan using a state-space model. This post is originally inspired from Jim Savage’s gist located here.

## Data Generating Process

$y_t \sim N(x_t + \epsilon_t, \sigma_{\epsilon})$

$x_t \sim N(x_{t-1}+Z_t\gamma_t + \eta_t, \sigma_{\eta})$

In the case of this model we only observe the values of $$y$$ at time $$t$$.

## Build Fake Data

First we’ll bring in our packages.

library(rstan)
suppressPackageStartupMessages(library(tidyverse))
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
set.seed(336)

Then we will generate the required data. In this case we will generate the entire DPG, then for our simulation will remove those data points that do not coincide with our measurement frequency.

# Set up DGP
n <- 300 # number of data points
freq <- 28
# High frequency helpers
z1 <- rnorm(n, 0, 3);
z2 <- rnorm(n, 0, 3)

# Set up "real" state
x <- rep(NA, n)
x <- 1
for(i in 2:n){
x[i] <- x[i-1] + 0.4*z1[i] + -0.3*z2[i] + rnorm(1, 0, 0.2)
}

# Set up y that is only recorded once for every "freq" values of x (set freq above)
y <- x + rnorm(n, 0, 1)

y[!(1:n%%freq==0)] <- NA

# Have a look at the data to make sure you know what's happening
dat <-data.frame(y, z1, z2) 

Format the data for stan.

# y is now just the observed values of y
y <- dat$y[!is.na(dat$y)]

model_list <- list(N1 = length(y),
N2 = n,
freq = freq,
y = y,
z1 = z1,
z2 = z2)

## Build the Model

writeLines(readLines("stan_nowcast.stan"))
// From https://github.com/khakieconomics/nowcasting_in_stan/blob/master/nowcasting.stan
// Via James Savage

data {
int N1; // length of low frequency series
int N2; // length of high frequency series
int freq; // every freq-th observation of the high frequency series we get an observation of the low frequency one
vector[N1] y;
vector[N2] z1;
vector[N2] z2;
}
parameters {
real<lower = 0> sigma_epsilon;
real<lower = 0> sigma_eta;
vector gamma;
vector[N2] x;
}
model {
int count;
// priors
sigma_epsilon ~ cauchy(0,1);
sigma_eta ~ cauchy(0,1);
gamma ~ normal(0,1);
//increment_log_prob(normal_lpdf(x, 0, 1));

// likelihood
count = 0;
for(i in 2:N2){
target += normal_lpdf(x[i]| x[i-1] + z1[i]*gamma + z2[i]*gamma, sigma_eta);
if(i%freq==0){
count = count + 1;
target += normal_lpdf(y[count]| x[i], sigma_epsilon);
}
}
}

Now we have to compile our model.

model <- stan_model("stan_nowcast.stan")

## Fit the model

fit <- sampling(model, model_list, iter = 2000,
chains = 2, refresh = 0,
max_treedepth = 15))
## Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems

## Model Checking

And as always special thanks to Michael Betancourt for these amazing tools for diagnostics.

util <- new.env()
source('stan_utilities.R', local=util)
util$check_all_diagnostics(fit) ##  "n_eff / iter looks reasonable for all parameters" ##  "Rhat for parameter sigma_eta is 1.1372136448821!" ##  " Rhat above 1.1 indicates that the chains very likely have not mixed" ##  "0 of 2000 iterations ended with a divergence (0%)" ##  "813 of 2000 iterations saturated the maximum tree depth of 10 (40.65%)" ##  " Run again with max_treedepth set to a larger value to avoid saturation" ##  "Chain 1: E-FMI = 0.0218018506545009" ##  "Chain 2: E-FMI = 0.00913922915443346" ##  " E-FMI below 0.2 indicates you may need to reparameterize your model" So it appears that this model needs a little more work! ## Now some inferences So while our model parameterisation isn’t perfect the coefficients in the data generating process have been sussed out of the data by our model. So this isn’t a complete loss! print(fit, pars = c("gamma")) ## Inference for Stan model: stan_nowcast. ## 2 chains, each with iter=2000; warmup=1000; thin=1; ## post-warmup draws per chain=1000, total post-warmup draws=2000. ## ## mean se_mean sd 2.5% 25% 50% 75% 98% n_eff Rhat ## gamma 0.46 0 0.06 0.33 0.42 0.46 0.50 0.57 721 1 ## gamma -0.31 0 0.05 -0.41 -0.34 -0.31 -0.28 -0.22 1221 1 ## ## Samples were drawn using NUTS(diag_e) at Thu Jun 20 10:52:22 2019. ## For each parameter, n_eff is a crude measure of effective sample size, ## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1). stan_plot(fit, pars = "gamma") ## ci_level: 0.8 (80% intervals) ## outer_level: 0.95 (95% intervals) ## Make the Graph # Extract the estimates of the state x_mod <- extract(fit, pars = "x", permuted = F) x_mod <- plyr::adply(x_mod, 2) # Summarise the parameters yy <- dat$y
x_summarise <- x_mod %>%
dplyr::select(-chains) %>%
gather(variable, value) %>%
mutate(obs = str_extract(variable, "[0-9]{1,4}") %>% as.numeric) %>%
group_by(obs) %>%
summarise(Median = median(value),
Lower = quantile(value, 0.025),
Upper = quantile(value, 0.975)) %>%
mutate(Actual = x,
Signal = yy)

Now to the graph, but first we can check how often a result occured outside of our confidence interval (technically highest density interval).

mean(x_summarise$Actual<x_summarise$Lower | x_summarise$Actual>x_summarise$Upper) %>%
scales::percent()
##  "0.333%"
x_summarise %>%
ggplot(aes(x = obs)) +
geom_ribbon(aes(ymin = Lower, ymax = Upper), fill = "orange", alpha = 0.5) +
geom_line(aes(y = Median)) +
geom_line(aes(y = Actual), colour = "red") +
geom_point(aes(y = Signal), size = 2, color = "blue") +
labs(title = "Points are low-frequency observations\nRed is actual underlying (hf) series\nblack and orange are our estimate bounds")+
theme_minimal() Research and Methods Resources
me.dewitt.jr@gmail.com

Winston- Salem, NC

Michael DeWitt