Component Separation Using Bayesian Methods

这份 notebook 结合了我自己的作业以及 Duncan 的答案和标准答案。

Chapter 1 - Simple Test Case

Before we move on to any real data, let us take a step back and look at a couple ways to solve for parameters from a distribution.

First let us draw 1000 samples from a Gaussian distribution and look at a histogram of the result.

As a reminder, a Gaussian distribution for parameter $x$ with mean $\mu$ and standard deviation $\sigma$ is given by

\begin{equation} p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}. \end{equation}

Now let us suppose we only have the data vector defined above, and want to find the mean value of that data. How do we go about that?

The first step is always to craft a data model. Let us switch to an observational standpoint. We start by assuming that the true value of the data $\mu'$ is a single value who's obserbed value has been obscured by some underlying Gaussian noise, i.e. $x = \mu' + n$ where $x$ is the data, and $n$ is some Gaussian noise. We will generalize this to the following equation, which ought to be baked into your memory

\begin{equation} d = s + n \end{equation}

where $d$ is the observed data, $s$ is the true underlying signal, and $n$ is again a noise term. This formalism is the basis of our component separation methods and will be expanded upon further. Down the road.

Maximum-Likelihood Solution

Now that we have a simple data model, how does one go about actually determining the true underlying signal $s$? A good starting point is by finding what is called the maximum-likelihood solution, using maximum likelihood estimation (MLE). A cursory Google search for maximum likelihood estimation will provide a mix of math heavy resources (like Wikipedia), and some practical examples (like towardsdatascience).

Here we will stick with a Gaussian assumption, i.e. that the noise inherent in the data has a Gaussian form.

For any observed point $x_i$, the probability of observing that data point, given the underlying mean ($\mu$) and standard deviation ($\sigma$) of the distribution is given by

$p(x_i|\mu,\,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}$,

and the probability of observing say 3 data points $x_1$, $x_2$, and $x_3$ is given by

$p(x_1,x_2,x_3|\mu,\,\sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_1-\mu)^2}{2\sigma^2}}\times\frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_2-\mu)^2}{2\sigma^2}}\times\frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_3-\mu)^2}{2\sigma^2}}$.

Let us take all of our observations into account and condense our probability expression (the likelihood) to be

\begin{equation} p(\vec{x}|\mu,\,\sigma) = {\displaystyle \prod_{i=1}^{1000}}\frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \end{equation}

From our Calculus experience, we may note that a sure fire way to determine the maximum or minimum of a function is by taking its derivative. However we can tell by looking at the above equation that taking the derivative of this thing will be a cumbersome operation.

Those versed in MLE will be quick to point out that taking the natural logarithm of the above expression gives us something nice to work with!

To be explicit, finding the maximum likelihood solution involves finding $\mu$ where

\begin{equation} \frac{\partial \ln (p(\vec{x}|\mu,\,\sigma))}{\partial \mu} = 0. \end{equation}

Before we move on, let us touch on a couple final pieces of information which will be important for the future. First let's consider the equation we are solving here $p(\vec{x}|\mu,\,\sigma)$. This equation reads "probability of the data given the parameters $\mu$ and $\sigma$. Since we already know the data, and don't know the values of $\mu$ and $\sigma$ a priori, it appears that this equation itself isn't very useful. However, the likelihood of having $\mu$ and $\sigma$ given some data $\vec{x}$ is defined as

\begin{equation} L(\mu,\,\sigma|\vec{x}) = p(\vec{x}|\mu,\,\sigma). \end{equation}

Though the equations are equal, they ask different questions. The left-hand side is concerned about the likelihood of the parameters given the data, while the right-hand side is asking about what the probability of observing the data is, given the parameters. Hence the phrasing Maximum Likelihood Solution.

Finally, as we will see often later on, let us clearly define the log-likelihood as

\begin{equation} \mathcal{L}(\mu,\,\sigma|\vec{x}) = \ln\,p(\vec{x}|\mu,\,\sigma) \end{equation}

Compiled with help from towardsdatascience.com.

Task 1:

$$ \begin{align} \mathcal{L}(\mu,\sigma|\vec{x}) &=\ln\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left[{-\frac{(x_i-\mu)^2}{2\sigma^2}}\right] \\&=-N\ln\sqrt{2\pi\sigma^2}-\sum_{i=1}^{N}{\frac{(x_i-\mu)^2}{2\sigma^2}} \\&=-\frac{N}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^{N}(x_i-\mu)^2 \end{align} $$

where $N=1000$ (hereafter).

$$ \begin{align} \frac{\partial \mathcal{L}(\mu,\sigma|\vec{x})}{\partial\mu} &=\frac{1}{2\sigma^2}\sum_{i=1}^N 2(x_i-\mu) =\frac{1}{\sigma^2}\sum_{i=1}^N (x_i-\mu)=0 \end{align} $$

Since $\sigma$ is probably a non-zero constant, the mean of the data can be expressed as

$$ \mu = \frac{1}{N}\sum_{i=1}^N x_i $$

Multi-frequency example

As we will see further on, determing the mean of an observable can be aided by (or in practice requires) the addition of multiple data points. As we move more towards a realistic example, we will say that we have observations at multiple frequencies. For a simple example, let's consider some observable which scales linearly with frequency.

Let's create a new data set, where our observable has a mean value $a_{obs} = \mu$, and each observation frequency has its own noise characterisitic $n_{\nu}$.

When carrying out this multi-frequency analysis, we need to determine a reference frequency $\nu_{\rm ref}$, at which we evaluate the recovered amplitude $a_{obs}$.

Therefore, our new data model will be written as

$d_{\nu} = a_{obs}(\nu/\nu_{\rm ref}) + n_{\nu}$.

What does this data look like for us?

We can see that it would be simple to draw a straight line through the points, though many straight lines may fit the through all of the error bars. It is vital that we make an estimate of the best fit to the data, not just a fit.

Task 2:

$$ \mathcal{L}(a_{obs}|\boldsymbol{d})=\sum_\nu\mathcal{L}(a_{obs}|d_\nu) $$

The likelihood for "multi-frequency" is sharper which indicates a more precise measurement of $a_{obs}$.

Interjection - Some Important Information

Generalization

Up until now we have taken a very simple example, where we have a data vector containing samples, and we find the MLS. Before we move to drawing samples from noisy data, we will define some more general nomenclature. Our original data model looked something like

\begin{equation} d = s + n. \end{equation}

We then expanded the actual signal into an amplitude and a frequency scaling

\begin{equation} s = a_{obs}(\nu/\nu_{\rm ref}). \end{equation}

This expansion of the signal term can be generalized in the following way

\begin{equation} s = Ta, \end{equation}

where $T$ is a matrix containing the scaling relation information (in this simple case $T=(\nu/\nu_{\rm ref})$), and $a$ is a vector corresponding to the amplitudes we are looking for $a=a_{ obs}$.

In essence, by finding the ML solution, we are solving the equation

\begin{equation} (T^t N^{-1} T)a = T^t N^{-1} d, \end{equation}

where $N^{-1}$ is the noise covariance matrix.

To be explicit, through the remainder of this notebook we will be looking and multi-frequency observations (vital in Cosmological Component Separation) and aim to solve for the true signal sky signal $s_{\nu}$ with the data model

\begin{equation} d_{\nu} = s_{\nu} + n_{\nu}, \end{equation}

by solving the linear equation

\begin{equation} {\displaystyle \sum_{\nu}(T_{\nu}^t N_{\nu}^{-1} T_{\nu})a = \sum_{\nu}T_{\nu}^t N_{\nu}^{-1} d_{\nu}}. \end{equation}

This equation will be expanded upon to include sampling and prior terms throughout this notebook.

Note that when we move to more complex systems (many data sets with many pixels), filling and multiplying these matricies can be very expensive both computationally and for memory usage. A student interested in should read "An Introduction to the Conjugate Gradient Method withouth the Agonizing Pain" by Jonathan Shewchuk. It is excellent. Additionally, Numerical Recipes is always your friend when trying to solve complex linear systems computationally.

Bayes Theorem

Before getting too far into the weeds here, we will introduce Bayes' Theorem as well as a practical definition for our purposes. Given a set of parameters $\vec{\omega} = \{\omega_1,...,\omega_k\}$, we will define the posterior distribution of $\boldsymbol \omega$ with respect to the data as

\begin{equation} P(\vec{\omega}\mid\boldsymbol{d}) = \frac{P(\boldsymbol{d} \mid \vec{\omega}) P(\vec{\omega})}{P(\boldsymbol{d})} \propto \mathcal{L}(\vec{\omega}) P(\vec{\omega}), \end{equation}

where $ P(\boldsymbol{d}\mid\vec{\omega}) \equiv \mathcal{L}(\vec{\omega})$ is the likelihood function (as we have talked about above), $P(\vec{\omega})$ is the prior, and $P(\boldsymbol{d})$ is a normalization factor which we disregard here as it is independent of the parameters $\vec{\omega}$.

Chapter 2 - Sampling Methods

In Chapter 1 we have simply drawn Gaussian samples of the data using numpy, having already known the mean and standard deviation of our data. In map-space analysis, we work with sky maps that are characterized by a total sky signal amplitude $s_{\nu}$, with an estimate of the noise $n_{\nu}$, giving us our data

\begin{equation} d_{\nu} = s_{\nu} + n_{\nu}. \end{equation}

This has been stated before and will be stated again in hopes of making you see the above equation when you close your eyes. We now wish to sample for $s_{\nu}$, the true sky signal, seeing as we know the data $d_{\nu}$ and have an estimate of the noise $n_{\nu}$.

Here we introduce three methods which we can use for sampling parameters. The Metropolis and Metropolis-Hastings methods are broadly used in MCMC methods and will be introduced here. These methods are usable in the general sense. Inversion Sampling is also introduced, which is used in the Commander/Cosmoglobe framework for sampling non-linear parameters, i.e., spectral indices. Finally from a Gaussian distribution is introduced. This is a core method to learn for sampling linear parameters and will be expanded upon in the next Chapter to the multi-variate case. In the Cosmoglobe/Commander framework this method is used to sample for Galactic foreground amplitudes.

Metropolis and Metropolis-Hastings Methods

The Metropolis method is by far the most well-known of all of the MCMC sampling methods due to it's ease of implementation, tunability, and effectiveness. The Metropolis and Metropolis-Hastings algorithms are effective as all samples with a higher likelihood are accepted, while some samples with lower likelihoods will be accepted from time to time (this can be helpful to avoid becoming stuck around a local likelihood peak).

Let us say we wish to sample for some parameter $\omega$. Let $\omega_j$ represent te $j$th sample of $\omega$ in a Markov chain. We also need to define a stochastic proposal probablility distribution, $T(\omega_{j+1}\mid \omega_{j})$, with which we tune in order to help with parameter convergence. In the Metropolis method, the proposal distribution is symmetric, i.e., $T(\omega_{j+1}\mid \omega_{j})=T(\omega_{j}\mid \omega_{j+1})$. In the Metropolis-Hastings method, this proposal distribution need not be symmetric.

The next thing we need to define is the acceptance criterion, which tells us which samples we will accept and which ones we will not. This acceptance criterion is defined by the ratio of the densities, which is given by the ratio of the posterior probabilities as defined by Bayes' theorem ($\frac{P(\boldsymbol d \mid \omega_{j+1})}{P(\boldsymbol d \mid \omega_{j})}$). If a prior is used to inform our sampling, this ratio is weighted by the ratio of the prior probabilities. Therefore our total definition of the acceptance probability $q$ is defined as

\begin{equation} q = \frac{P(\boldsymbol d \mid \omega_{j+1})}{P(\boldsymbol d \mid \omega_{j})}\frac{P(\omega_{j+1})}{P(\omega_j)}. \end{equation}

With these defined, we can now outline the algorithm itself. The algorthim is given by:

  1. Initialie chain at some parameter value $\omega_0$.
  2. Draw a random sample from the proposal probability distribution, i.e., $\omega_{j+1} \leftarrow T(\omega_{j+1}\mid\omega_j)$.
  3. Compute the acceptance probability $q$.
  4. Draw a random number $\eta$ from a uniform distribution $U[0,1]$.
  5. Accept the proposed $\omega_{j+1}$ if $\eta < q$. Otherwise, set $\omega_{j+1} = \omega_{j}$.
  6. Repeate steps 2-5 until convergence.

Inversion sampling

Now that we have sampled the two amplitude parameters, we want to sample spectral parameters, which do not depend linearly on the data $d$, and are not Gaussian distributions, hence demand a different approach. To sample these two conditional distributions, we employ the "inversion sampler", which is completely general and works for all univariate distributions. In our case, we have the distribution \begin{align} P(\beta\mid d, a) &\propto P(d\mid a, \beta) P(\beta)\\ &\propto \left[\prod_{\nu} \mathrm \exp\left(-\frac{1}{2}\left(d_{\nu}-T(\beta)a\right)^t N_{\nu}^{-1}\left(d_{\nu}-T(\beta)a\right)\right)\right]P(\beta). \end{align}

Using this for P(x) we can sample using the inversion sampler by:

  1. Compute $P(x)$ over a grid in $x$, making sure to probe the tails to sufficient accuracy.
  2. Compute the cumulative probability distribution, ${F(x) = \int_{-\infty}^{x} P(x')\,\mathrm dx'}$.
  3. Draw a random uniform variate, $\eta \sim U[0,1]$.
  4. Solve the nonlinear equation $\eta = F(x)$ for $x$.

Sampling a Gaussian

We may write $d = Ta+n = f(\omega)\cdot A + n$ where $\omega$ is the set of spectral indices (such as $\beta_{\rm s}$ or $T_{\rm d}$ for dust). If we rewrite this and assume that the noise is gaussian $n = d - Ta$ we can express the distribution as \begin{align} P(a\mid d, \omega\setminus a) &\propto P(d\mid\omega)P(a)\\ &\propto P(d\mid a)P(a)\\ &\propto \left[\prod_{\nu} \exp\left(-\frac{1}{2}(d_{\nu}-T_{\nu}a)^t N_{\nu}^{-1}(d_{\nu}-T_{\nu}a)\right)\right]\cdot P(a) \end{align} which can be sampled by solving for $a$ in $$\biggl(S^{-1} + \sum_{\nu}T^t_{\nu}N_{\nu}^{-1}T_{\nu}\biggr)\,a = \sum_\nu T_{\nu}^t N_\nu^{-1}d_{\nu}+\sum_\nu T_{\nu}^t N_\nu^{-1/2}m_{\nu} + \sum_{\nu}T_{\nu}^tN_{\nu}^{-1/2}\eta_{\nu} + S^{-1/2}\eta_{0},$$ where S is a prior standarad deviation, N is the noise, m is the prior mean, and $\eta$ are random normal values. For a full explanation on this, look at BeyondPlanck 1, appendix 2.

Note here that the simplified case (no prior)

\begin{equation} \biggl(\sum_{\nu}T^t_{\nu}N_{\nu}^{-1}T_{\nu}\biggr)\,a = \sum_\nu T_{\nu}^t N_\nu^{-1}d_{\nu} + \sum_{\nu}T_{\nu}^tN_{\nu}^{-1/2}\eta_{\nu}, \end{equation}

is the same equation at the end of the previous section with an additional sampling term.

We can also think of the above equation as

\begin{equation} a = \biggl(\sum_{\nu}T^t_{\nu}N_{\nu}^{-1}T_{\nu}\biggr)^{-1}\sum_\nu T_{\nu}^t N_\nu^{-1}d_{\nu} + \biggl(\sum_{\nu}T^t_{\nu}N_{\nu}^{-1}T_{\nu}\biggr)^{-1}\sum_{\nu}T_{\nu}^tN_{\nu}^{-1/2}\eta_{\nu}, \end{equation}

which is nice for the most simple cases.

Task 3:

Some notes for the code. There are $m$ frequencies. $$ P(\boldsymbol{d}|a,\beta)=\frac{1}{(2\pi)^{m/2}|N|^{1/2}}\exp\left[-\frac{1}{2}(\boldsymbol{d}-Ta)^tN^{-1}(\boldsymbol{d}-Ta) \right] $$

$$ \mathcal{L}(\boldsymbol{d}|a,\beta)=-\frac{1}{2}\ln (2\pi)^m |N|-\frac{1}{2}(\boldsymbol{d}-Ta)^tN^{-1}(\boldsymbol{d}-Ta) $$

Since $T$ and $N$ are both diagonal here, I treat them as vetors in later calculations.

Chapter 3 - A Step Closer to the Microwave Sky - Introducing More Foregrounds

In this chapter we will simulate a simplified example of the Microwave Sky. We will consider a single-pixel example that includes the CMB, synchrotron, and thermal dust radiation, with frequencies similar to that of the Planck satellite.

Simulate a Single Pixel Data Set

Let's begin by first importing a couple useful functions and constants. We will also introduce an array of frequencies corresponding to the Planck satellite band's central frequencies, as well as their corresponding polarization sensitivities. Will use these for the remainder of the exercises. You are of course free to add your own frequencies and sensitivities if you wish.

Simulated sky data

Now we can simulate our data using this sky model for a given set of frequencies. Below we set the model parameters, the set of frequencies, the level of noise for each frequency and which frequency we use as pivot frequency (reference frequency) for dust and syncrotron.

We have added all the Planck frequencies for to illuminate the coverage of the experiment. We will begin with an example similar to that of our last. Synchrotron radiation is, with both physical and observation support, assume to have a power-law frequency scaling (in $\mu K_{\rm RJ}$ units, see Synchrotron lecture). This is essentially the same model we just ran with, with just an additional term. Explicitly the synchrotron sky signal at frequency $\nu$, $s_{\rm s,\,\nu}$ is given by

\begin{equation} s_{\rm s,\,\nu} = A_{\rm s} \Big(\frac{\nu}{\nu_{\rm 0,\,s}}\Big)^{\beta_{\rm s}}, \end{equation}

where $\beta_{\rm s}$ has typical values between -3.0 and -3.2. Let us begin by defining a simple class to store our synchrotron information.

Now we will simulate the synchrotron sky signal across the Planck frequency channels, and sample the corresponding parameters (i.e. $A_{\rm s}$ and $\beta_{\rm s}$).

If we print out the data we can see that we quickly hit the "noise floor", where the magnitude of the synchrotron emission is lower than the uncertainty in our measurement. Gathering high signal-to-noise observations is vital in CMB experiment, specifically in polarization as the CMB signal is very weak. Hopefully we will see some of the difficulties this can cause us through these exercises.

Task 4:

For data cuts, I will only choose [33, 44, 70, 100, 143] bands for $A_s$ estimation where the noise does not significantly larger than the synchrotron signal.

The maximum likelihood $A_s$ doesn't change too much after date cuts. And the samples chain doesn't seem to change a lot either. So data cuts didn't improve $A_s$ estimation in my results.

I think that's because the noise realizaiton in higher frequency bands didn't go far too away from the typical noise level, and when we did ML estimation of $A_s$, the matrix $N^{-1}$ had already given these data a proper smaller weight, and thus the data in higher frequency bands didn't affect my results a lot.

If the noise realization is far too away from the typical noise level, I believe it's better to make data cuts especially when we have only a few data points.

Task 5

可以看到谱指数的分布并不是高斯的。

Gibbs Sampling

Now that we have sampled two parameters separately, while holding the other constant, we can introduce Gibbs sampling. Gibbs sampling is the core of the $\texttt{Commander}$ and Cosmoglobe frameworks, and is a useful tool for mapping out multi-dimensional parameter spaces. Consider a data model with some set of parameters $\vec{\omega} = \{\omega_0,..,\omega_k\}$. We utilize Bayes Theorem, as defined earlier, to explore this parameter space. Gibbs sampling functions by sampling each of these parameters $\omega_{i}$ individually, while holding all of the parameters constant. Therefore, in each Gibbs chain there are $k$ steps, such that we draw samples in the following way

\begin{align} \omega_{0} &\leftarrow P(\omega_{0}\mid \vec{d},\,\omega_{1},\,\omega_{2},\cdots,\,\omega_{k})\\ \omega_{1} &\leftarrow P(\omega_{1}\mid \vec{d},\,\omega_{0},\,\omega_{2},\,\cdots,\,\omega_{k})\\ &\vdots\\ \omega_{k} &\leftarrow P(\omega_{k}\mid \vec{d},\,\omega_{0},\,\omega_{1},\,\cdots,\,\omega_{k-1}). \end{align}

Let's try a simple example here with our two synchrotron parameters. Seeing as we have simulated the data ourselves, we already know what the correct answer should be, but in practice we don't know what the true sky signal looks like a priori. Let's say that we think that the real synchrotron amplitude $A_{\rm s} = 10.0$, and that the synchrotron spectral index $\beta_{\rm s} = -3.0$.

Task 6

The lines are $1\sigma$, $2\sigma$, $3\sigma$ constrains.

红色的线是用格点算出来的,跟黑色的样本点符合得还不错,但是一般情况下需要用样本点去估计 $1\sigma$,$2\sigma$ 以及 $3\sigma$ 所在的位置。这是一个比较错误的示范,因为作业做到这里的时候我还不会画那种误差圈圈的图。

Chapter 4 - Now with More Components - Expanding the Gibbs sampler

Excellent! Well done!

Now we're really getting some practical stuff done! Hopefully you've gained some understanding of the theory behind sampling for parameters, and how it is done in practice as well. Now that we have the base infrastructure built, we want to expand to more and more realistic examples. Let's introduce some more foregrounds which you can create classes for and simulate a more complex data set. Let's add two more components, thermal dust emission, and the CMB. These are two good components to introduce as they are prevalent in total intensity, as well as in polarization. Recall that we are working in Rayleigh-Jeans units, and will introduce the parametric models for these components in such units. And as a consistent reminder, details on microwave sky foregrounds can be found in the online lectures.

CMB

For the CMB radiation we have $$s_{\mathrm{CMB},\,\nu} = a_{\mathrm{CMB},\,\nu} \frac{x^2e^{x}}{\left(e^{x}-1\right)^2}$$ where $a_{\mathrm{CMB}}$ is the strength (amplitude) of the signal and the rest is a converstion factor between Rayleigh Jeans units and CMB units where $x=h\nu/kT_0$.

Thermal dust

Thermal dust dominates the higher frequencies. Dust is modelled with a modified blackbody function, $$s_{\mathrm{d},\,\nu} = a_{\mathrm{d},\,\nu}\,\left(\frac{\nu}{\nu_{0,\mathrm{d}}}\right)^{\beta_{\mathrm{d}}+1}\frac{e^{h\nu_{\mathrm{0,\mathrm{d}}}/kT_{\mathrm{d}}}-1}{e^{h\nu /kT_{\mathrm{d}}}-1}$$ where $a_{\mathrm{d}}$ is the strength of the signal at the reference frequency $\nu_{0,\mathrm{d}}$, $\beta_{\mathrm{d}}$ is the spectral index and $T_{\mathrm{d}}$ is the dust temperature.

Full sky model

The sky model $s_{\mathrm{RJ},\,\nu}$ that we will use is thus given by this model $$s_{\mathrm{RJ},\,\nu} =s_{\mathrm{CMB},\,\nu} + s_{\mathrm{s},\,\nu}+ s_{\mathrm{d},\,\nu}$$containing a total of seven parameters to be sampled. Three amplitudes ($a_{\mathrm{CMB}}$, $a_{\mathrm{s}}$ and $a_{\mathrm{d}}$), two spectral indexes ($\beta_{\mathrm{s}}$ and $\beta_{\mathrm{d}}$) and thermal dust ($T_{\mathrm{d}}$). Given the total degrees of freedom, we need at least 7 frequency bands to constrain this model.

Task 7

Now that we have simulated a more complex data model, something that better represents the real microwave sky, we can Gibbs sample the components!

Task 8

当我们在采样不同参数的时候,应当根据参数的特点采用不同采样方法。在采样 $A$ 的时候应当采用高斯采样的方法,而采样 $\beta$ 和 $T_\mathrm{d}$ 的时候应当采用更为普遍的 Inversion 采样方法。

可以看到 $A_\mathrm{cmb}$ 和 $A_s$ 的强简并性。