3/29/2016
(This was largely expanded from notes from Pierre Jacob at Harvard)
Some distributions we can choose:
The two posterior distributions.
The easy one:
\[ \pi\left(\theta_{m}\mid \mathcal{M}_{m}, Y\right) \propto p\left(Y \mid \theta_{m},\mathcal{M}_{m}\right)p\left(\theta_{m}\mid\mathcal{M}_{m}\right) \]
The hard one:
\[ \pi\left(\mathcal{M}_{m},\theta_{m}\mid Y\right) \propto p\left(Y \mid \theta_{m},\mathcal{M}_{m}\right)p\left(\theta_{m}\mid\mathcal{M}_{m}\right)p\left(\mathcal{M}_{m}\right) \]
We'll ignore the first one for today, we've seen lots of methods for this already. Also, denote the second posterior as: \[ \pi\left(m,\theta_{m}\right) \]
\( (m,\theta_{m})\in \mathcal{H} \)
Consider proposals of the form: \[ q(m\to m^{\prime})q_{m\to m^{\prime}}(\theta\to\theta^{\prime})d\theta^{\prime} \]
Propose \( m^{\prime} \) from \( q(m\to m^{\prime}) \)
\( \theta \) and \( \ \theta^{\prime} \) can be of different dimensions, use auxillary variables to match the dimensions \[ dim\left((\theta,u)\right) = dim\left((\theta^{\prime},u^{\prime})\right) \]
Overall idea is then to transform old variables \( (\theta,u) \) into new variables \( (\theta^{\prime},u^{\prime}) \)
Auxillary variables are used to match dimensions, we can draw them from arbitraty distributions \[ u \sim \varphi_{m\to m^{\prime}}(\cdot) \ \text{and} \ u' \sim\varphi_{m^{\prime}\to m}(\cdot), \]
Recall acceptence probability. It involved a derivative of this transformation. So choose a nice one! \[ (\theta^{\prime},u^{\prime})=G_{m\to m^{\prime}}(\theta,u) \]
For example (Diffeomorphism):
\[ \min\left(1,\frac{\pi(m^{\prime},\theta^{\prime})q(m^{\prime}\to m\text{)}\varphi_{m^{\prime}\to m}(u^{\prime})}{\pi(m,\theta)q(m\to m^{\prime})\varphi_{m\to m^{\prime}}(u)}\left\vert \frac{\partial G_{m\to m^{\prime}}(\theta,u)}{\partial(\theta,u)}\right\vert \right) \]
This is just the regular old Hastings acceptence probability but
More things will play a role in the efficiency of this algorithm than in other MCMC methods
Want to fit the data \( (y_1, ..., y_n) \) to one of two models, either \[ y_i \sim Exp(\lambda)\\ y_i \sim Gamma(\alpha, \beta) \] Let's be Bayesian: \[ \lambda \sim Gamma(a_1, b_1)\\ \alpha \sim Gamma(a_2, b_2)\\ \beta \sim Gamma(a_3, b_3) \]
We can also put priors on our models:
We want to sample from \( \pi(m, \theta_m) \), where
\( m\in\{1, 2\} = \{Exp, Gamma\} \),
\( \theta_1 = \lambda, \theta_2 = (\alpha, \beta) \)
First direction: \( \ q(Exp \to Gamma) \)
Jacobian:
If we force alternating between models and equal model priors.
\[ A = \frac{L_{G}(x|\alpha, \beta)p(\alpha)}{L_{Exp}(x|\lambda)p(\lambda)q(u)}u \]
The other direction
We can generally write: \[ y_i \sim \sum_{j = 1}^k w_j \ f(\cdot\ |\ \theta_j) \]
Generative-model formulation is for each observation:
How Reversible Jump MCMC could play a role here is when the number of components \( \ k \) is unknown.
The full Bayesian model as a generative hierarchy:
\[ p(y, \theta, z, w, k) = p(k)p(w\ |\ k)p(z\ |\ w, k)p(\theta\ |\ z, w, k)p(y\ |\ \theta, z, w, k) \]
(for posterior sampling (using Sneaky Bayes Rule))
\[ \begin{align} p(\theta, z, w, k\ |\ y) & \\ &= p(w\ |\ z, \theta, k, y) p(\theta, z, k \ |\ y)\\ &= p(\theta\ |\ z, w, k, y) p(z,w, k \ |\ y)\\ &= p(z\ |\ \theta, w, k, y) p(\theta,w, k \ |\ y)\\ &= p(k\ |\ z, w, \theta, y) p(\theta, z,w \ |\ y)\\ \end{align} \]
We cycle through these equalities to end up with Gibbs Samplers… but not for everything.
Let's find out why. Together.
If we assume a Normal \( f(\cdot\ |\ \theta_j) \), then \( \theta_j = (\mu_j, \sigma_j) \), we can put conjugate priors on these.
\[ \mu_j\sim N(\xi, \kappa), \sigma_j^2 \sim gamma^{-1}(\alpha, \beta) \]
We can also put priors on the things that depend on \( k \):
\[ w = (w_1, ..., w_k)\sim Dir(\delta, \delta, ..., \delta) \]
Again, some of these moves can be done with traditional MCMC, like Gibbs (a nice example was posted on Piazza, Thanks Elena!).
\( n_j = \#\{z_i = j\} \). These are the posteriors for fixed \( k \)
(Death to Traditional MCMC): Some of the parameters change dimensions \[ |\theta| = 2^k\\ |w| = k\\ z_i \in \{1, ..., k\} \]
Richardson and Green (1997) propose the following pairs of moves for sampling \( k \):
For the Split Move \( \alpha = min(1, A) \)
For the combine move it's \( \alpha = min(1, A^{-1}) \), with the “obvious substitutions” o_0. See Richardson and Green (1997).