Sampling Techniques
Tags | CS 228Inference |
---|
Sampling?
In the following sections, we will be using sampling methods to do approximate inference. Often, this requires you to sample from a distribution. We will first talk generally about when things are tractable and when they are not, and then we will talk about specific structures.
Tractability
Generally speaking, if you have two sets of variables and we want , it is not tractable because you have to marginalize over . We can estimate this distribution using Monte Carlo. We will talk about this later. Essentially you will need to sample from a different distribution. But how do you do that?
Bayes network
In general, Bayes networks are pretty easy to sample from. Fully observed, you can also find the probabilities very easily of the joint distribution.
Special case: when it’s a tree (and no observation)
If it’s a tree, just start by sampling at the root and propagating the values downwards by filling in the sampled values into the CPDs.
No observation
More generally, you just have to sort the nodes topologically and then do an in-order sampling.
With observed nodes
If you need to compute , which can be intractable. It’s even worse with sampling, because you need to propagate the observation backwards. For example, if you had , then . None of these are easy to compute at all!
Markov Random Field
MRFs are pretty hard to sample from, but we have a few techniques!
Brute force
Find the joint factor, marginalize out one variable, sample, plug back in, and then sample another one, etc. This is intractable, but it’s the baseline algorithm
Special case: when there are no cycles
If there are no cycles (i.e. it’s a tree), you can do the following algorithm
- Run message passing up to the root
- Sample from the root
- Use the root value and the upward messages to get distributions for the root’s childrens. Sample from those
- Propagate this recursively until you reach the leaves
Note that you can’t just do message passing and sample from each individual distribution, because that assumes independence. Rather, you must propagate the sampled values down the line
With observed nodes
The key insight is that messages stop at observed nodes. In other words, is typically expressed in the sum-product notation, but if is observed, then we only need the binary factor and plug in .
At the risk of sounding a bit repetitive, if you are given a non-cyclical MRF, you can run message passing but every time you encounter an evidence node, you replace the message with a direct plug-in of the observation.
When you reach the root, sample as usual. Now, every sampled value becomes part of the “evidence”. You plug this sampled values into the factors of the children, and then you can sample from the children, and so on and so forth.
The diagram below illustrates sampling with two evidence nodes. The third picture shows how a sampling operation turns a variable into an evidence node.
Junction trees
The reason why we create junction trees is that the tree structure lends itself quite well with marginalization and sampling, while the cyclical structure is less convenient.
If we do have a junction tree, you would augment the sampling process in this way:
- When you sample from a junction tree node, you sample from all the variables in the node at the same time. This is done by finding the joint factor and then doing an exhaustive marginalization of one variable, sampling it, then propagating the sampled value to help sample the next variable, and so on and so forth. Because there is a set number of variables in each junction, we don’t have to worry about exponential complexity
- in essence, you’re just doing brute force sampling on the small junction nodes