Luis Campos (with Luke Miratrix)
4/19/2016
Electricity consumption over time
Storage and transmission of this data starts becoming an issue
Some requirements:
Summary:
Population quantity of interest, e.g. \( \ \mu(t) = \frac{1}{N}\sum_{i = 1}^N f_i(t) \)
One question: How do we estimate \( \ \mu(t) \) from a sample?
Sampling indicator \( \ S_i \) and selection probability \( \ \pi_i \)
Sample average \( (S) \), Horvitz-Thompson \( (HT) \), Hajek \( (H) \): for \( \ t \) \[ \begin{align} &\widehat\mu_{S}(t) = \frac{1}{n}\sum_{i = 1}^N S_i\ f_i(t),& &\widehat\mu_{HT}(t)= \frac{1}{N}\sum_{i = 1}^N\frac{S_i}{\pi_i} f_i(t)\\ &\widehat\mu_{H}(t) = \frac{1}{\widehat N}\sum_{i = 1}^N\frac{S_i}{\pi_i} f_i(t),& & \widehat N = \sum_{i = 1}^N\frac{S_i}{\pi_i} \end{align} \]
\( \widehat N \) is called the sample weight
Cardot, Goga, Lardin [2013a, b, 2014, 2015] - Asymptotics, properties, cool data, and more estimators.
Previous month's everage usage \( \ (x_i) \), current usage \( (\ f_i(t)) \). \( i \in \{1, ..., N\} \), \( t \in \{t_1, ..., t_T\} \)
Population: \[ \begin{align} &x_i \sim Gamma(2)\\ &f_i(t) \sim 10 + x_i(1 + sin(x_i)) + \varepsilon_i(t) \\ &[\varepsilon_i(t_1), ..., \varepsilon_i(t_T)] \sim N_T({\bf{0}}, K) \\ &(K)_{kl} = exp(-(t_k - t_l)^2/2) \end{align} \]
Sampling Mechanism - Poisson Sampling:
\[ \begin{align} \pi_i = 100\frac{x_i}{\sum_{i = 1}^N x_i},\ \ S_i \stackrel{ind}{\sim} Bern(\pi_i),\ \ n = \sum_{i = 1}^N S_i,\ \ \mathbb{E}(n) = 100 \end{align} \]
Root Mean Squared Error:
\[ Err(\widehat{\mu}_{*}, {\mu}) = \left(\frac{1}{T}\sum_{t = t_1}^{t_T}\left(\widehat{\mu}_{*}(t) - {\mu}(t)\right)^2 \right)^{1/2} \]
How did the Estimators do?
For this example, the RMSEs are:
Estimate | Error |
---|---|
Simple Estimate | 1.759 |
H-T Estimate | 0.309 |
Hajek Estimate | 0.229 |
Given the population, repeat 1,000 times:
Simulation Results:
Error of Estimates:
Estimate | Mean | SD | min | max |
---|---|---|---|---|
Simple Average | 1.48 | 0.22 | 0.59 | 2.35 |
H-T Estimate | 1.25 | 0.97 | 0.07 | 9.16 |
Hajek Estimate | 0.26 | 0.15 | 0.06 | 1.33 |
Covariance between \( \hat\mu_{HT}(r) \) and \( \hat\mu_{HT}(t) \):
\[ \gamma_{HT}(r, t) = \frac{1}{N^2}\left[\sum_{i\in U}\frac{(1-\pi_i)}{\pi_i} f_i(r)\ f_i(t)\right] \]
An estimate (H-T-ification)
\[ \widehat\gamma_{HT}(r, t) = \frac{1}{N^2}\left[\sum_{i\in U} \color{red}{\frac{S_i}{\pi_i} }\frac{(1-\pi_i)}{\pi_i} f_i(r)\ f_i(t)\right] \]
Simulation based inference: Gelman and Hill (Ch. 7, 8), Cardot 2013a.
Consider the following:
One interpretation for weights:
We can write: \( 1/\pi_i= 1 + (1/\pi_i - 1) \)
Why this interpretation? \( \ \mathbb{E}\left(\sum_{i = 1}^N\frac{S_i}{\pi_i}\right) = N \)
Continuing this logic
\[ \begin{align} &\mu(t) = \frac{1}{N} \sum_{i = 1}^N f_i(t) = \frac{1}{N} \left(\sum_{i \in {\bf{S}}} f_i(t) + \color{red}{ \sum_{i \notin {\bf{S}}} f_i(t) }\right)\\ \end{align} \]
\[ \begin{align} \hat\mu_H(t) &= \frac{1}{N} \sum_{i = 1}^N \frac{S_i}{\pi_i} \ f_i(t)\\ &=\frac{1}{N} \left(\sum_{i \in {\bf{S}}} f_i(t) + \color{red}{ \sum_{i \in {\bf{S}}} \left(\frac{1}{\pi_i} - 1\right)\ f_i(t)}\right) \end{align} \]
Why model?
What Model? GP is natural model for functional data.
Model \( \ f_i(t) \) as a function of time and the auxiliary variable \( x_i \)
\[ \begin{align*} f_i(t) = f(t, x_i) &\sim GP(\nu(t, x_i), K)\\ K\left((t, x), (t', x')\right) &= \sigma_1^2 exp\left(-\frac{(t - t')^2}{l_1^2} -\frac{(x - x')^2}{l_2^2}\right)\\ &+\ \sigma_2^2 \delta_{x = x'} exp\left(-\frac{(t - t')^2}{l_3^2}\right) + \sigma_3^2 \delta_{x = x',\ t = t'} \end{align*} \]
\[ X_{obs} = \left( \begin{array}{cc} t_1 & x_1 \\ \vdots & \vdots\\ t_T & x_1 \\ \vdots & \vdots\\ t_1 & x_n \\ \vdots & \vdots\\ t_T & x_n \end{array} \right), Y_{obs} = \left( \begin{array}{c} f_1(t_1) \\ \vdots\\ f_1(t_T) \\ \vdots\\ f_n(t_1) \\ \vdots\\ f_n(t_T) \end{array} \right) \]
We can similarly define \( X_{*} \), the locations of unobserved function values and \( Y_{*} \) the unobserved function values.
We can model the unobserved outcomes.
Lifted straight out of Rasmussen & Williams (R&W 2006)
\[ \begin{align*} &Y_{*}|X_{obs}, Y_{obs}, X_{*} \sim N(\nu^*, \Sigma^*)\\ &\nu^* = K(X_{*}, X_{obs})\left[K(X_{obs}, X_{obs}) + \sigma^2 I\right]^{-1}Y_{obs}\\ &\Sigma^* = K(X_{*}, X_{obs}) + \sigma^2 I \\ &\ \ \ \ \ + K(X_{*}, X_{obs})\left[K(X_{obs}, X_{obs}) + \sigma^2 I\right]^{-1}K(X_{obs}, X_{*}) \end{align*} \]
Challenges:
Solutions:
Simple construction of average estimate
\[ \widehat \mu_{GP}(t) = \frac{1}{N} \left(\sum_{i \in {\bf{S}}} f_i(t) + \sum_{i \notin {\bf{S}}} \widehat f_i(t) \right) \]
Our model can quantify many levels of uncertainty
\[ Y_{*}|X_{obs}, Y_{obs}, X_{*} \sim N(\nu^*, \Sigma^*) \]
1.
Repeat the following \( \ J \) times
\[ Y_{*}^j\ |\ X_{obs}, Y_{obs}, X_{*} \sim N(\mu^*, \Sigma^*) \]
\[ \widehat \mu_{GP}^{\ j}(t) = \frac{1}{N} \left(\sum_{i \in {\bf{S}}} f_i(t) + \sum_{i \notin {\bf{S}}} \widetilde f_i^{\ j}(t) \right) \]
2.
Calculate confidence band with quantiles of \( \ \{\widehat\mu^{\ j}_{GP}(t)\} \)
Thanks!