Sampling Techniques

Tags	CS 228Inference

Sampling?

In the following sections, we will be using sampling methods to do approximate inference. Often, this requires you to sample from a distribution. We will first talk generally about when things are tractable and when they are not, and then we will talk about specific structures.

Tractability

Generally speaking, if you have two sets of variables $X, E$ and we want $P(E)$ , it is not tractable because you have to marginalize over $X$ . We can estimate this distribution using Monte Carlo. We will talk about this later. Essentially you will need to sample from a different distribution. But how do you do that?

Bayes network

In general, Bayes networks are pretty easy to sample from. Fully observed, you can also find the probabilities very easily of the joint distribution.

Special case: when it’s a tree (and no observation)

If it’s a tree, just start by sampling at the root and propagating the values downwards by filling in the sampled values into the CPDs.

No observation

More generally, you just have to sort the nodes topologically and then do an in-order sampling.

With observed nodes

If you need to compute $P(x | e) = \frac{P(x, e)}{P(e)}$ , which can be intractable. It’s even worse with sampling, because you need to propagate the observation backwards. For example, if you had $X → Y → Z → E$ , then $P(Z | E) = P(E | Z)P(Z) / \sum_Z P(E | Z)P(Z)$ . None of these are easy to compute at all!

Markov Random Field

MRFs are pretty hard to sample from, but we have a few techniques!

Brute force

Find the joint factor, marginalize out one variable, sample, plug back in, and then sample another one, etc. This is intractable, but it’s the baseline algorithm

Special case: when there are no cycles

If there are no cycles (i.e. it’s a tree), you can do the following algorithm

Run message passing up to the root

Sample from the root

Use the root value and the upward messages to get distributions for the root’s childrens. Sample from those

Propagate this recursively until you reach the leaves

Note that you can’t just do message passing and sample from each individual distribution, because that assumes independence. Rather, you must propagate the sampled values down the line

With observed nodes

The key insight is that messages stop at observed nodes. In other words, $\tau_{A→B}$ is typically expressed in the sum-product notation, but if $A$ is observed, then we only need the binary factor $\phi(A, B)$ and plug in $\phi’(B) = \phi(A = a, B)$ .

At the risk of sounding a bit repetitive, if you are given a non-cyclical MRF, you can run message passing but every time you encounter an evidence node, you replace the message with a direct plug-in of the observation.

When you reach the root, sample as usual. Now, every sampled value becomes part of the “evidence”. You plug this sampled values into the factors of the children, and then you can sample from the children, and so on and so forth.

The diagram below illustrates sampling with two evidence nodes. The third picture shows how a sampling operation turns a variable into an evidence node.

Junction trees

The reason why we create junction trees is that the tree structure lends itself quite well with marginalization and sampling, while the cyclical structure is less convenient.

If we do have a junction tree, you would augment the sampling process in this way:

When you sample from a junction tree node, you sample from all the variables in the node at the same time. This is done by finding the joint factor and then doing an exhaustive marginalization of one variable, sampling it, then propagating the sampled value to help sample the next variable, and so on and so forth. Because there is a set number of variables in each junction, we don’t have to worry about exponential complexity
- in essence, you’re just doing brute force sampling on the small junction nodes