A very different paradigm to inference in graphical models is the use of what's called sampling or particle based methods. In those methods, rather than trying to manipulate the exponential or even infinitely large probability distribution as a whole, we randomly sample instant, instances from that distribution and then use those instances as a sparse representation. We can use those instances then to estimate various quantities that we care about regarding the statistics of the overall distribution. Before we show how this is used in the context of graphical models, let's go ahead and see how this is applied in the simple setting. So first of all, how do we use samples for estimating quantities that we care about? So imagine that somehow, and we're going to talk about how that might happen somebody gives us or we manage to construct a data set d which consist of a bunch of samples from distribution P. And for the moment we're going to assume that these samples are what's called IID which stands for independent and identically distributed. That's where the IID comes from. And now, let's imagine that we're trying to compute or estimate. And this is all going to be approximate. approximate sum, quantity of the distrib-, of, of the distribution P. And so, for the moment, let's focus on the really simple case, where we're trying to figure out, where these are binary valued random variables. So the distribution P that we're sampling from is one where the probability that x = 1 is P. And so, think of this like tossing a coin. Now when are they reasonable estimator for the for this parameter P that we are trying to estimate. The probability that X falls heads. Although this is a fairly intuitive answer here, this estimator which we're going to call P sub d. Is, simply obtained by looking at all of the samples that we got and counting the fraction of ones. And if you think about it, that's a perfectly reasonable approximation. Right? Now, more generally, for any distribution P, and any function f whose statistics we're trying to estimate. We can approximate the expectation of the function f relative to the distribution P in terms of this weighted average of the value of f on the samples. And this is often called the Empirical Expectation. So this is how we might, if we had some way of which we have not yet talked about of constructing or sampling from a distribution. How we might use those samples for approximating various properties relative to the distribution p. Whether it's the probability of an event or the expectation of a function. And know by the way an expectation of a function subsumes the probability of an event because you can also take the expectation for example of an indicator function and that corresponds directly probability. So for example, this probability can be viewed as the expectation relative to P of the indicator function representing x1. = 1. What function that takes one when x equals one and zero otherwise. So, let's think for a moment. Just to be concrete about how one might go about sampling. From a discrete distribution. So imagine that we have, a discreet random variable x that takes on k values, x1 up to x^k. And let's assume that each of these has, that the xi occurs with propability theta i and so we have, theta 1 up to theta k. How might we go about using computational tools to generate samples from this distribution? While most computers give us a random number generator function that generates samples uniformly in the interval 01. And so if we want to convert that into something that samples from this discrete distribution the simplest way of doing that is to basically slip up the range and make this point say theta 1, this point theta 1 plus theta 2. Theta 1 plus theta 2 plus theta 3 and let's leave it at that. So, four values of the random variable. And now we toss, we use the random number function to sample from this space. So let's say it comes out here. And we basically say, okay, this was the range that corresponds to x1, x2, this is x3 and this is x4. And so since the point fell here, I'm going to declare that my random number, that my random value is x3. If I sample, again, and I sample in this, this point over here, I'm going to say, Oh, okay, next one. So that's how you would use a random number generator to, sample from a discreet distribution, something that we're going to end up using a lot, later in this lecture. So now that we know how to sample, let's think about how you might order some properties of sampling based destination. And it turns out that sampling base destination is, has some what appeared to be initially some very compelling performance guarantees. So the first that I'll show is something called the Hoeffding Bound. And, let's. Parse this, horribly complicated statement that we see over here. So, first of all, let's go, let's work from the inside out. This says my estimator which is TD remember this is my estimator for P. And this says is epsilon away, is more than epsilon away, from P. So this says, I'm badly wrong. My estimator is not close to the true value, which is p. Okay. I'm at least an epsilon away one side or the other. Now, notice that this is a property of the data set, the, the set of samples that I generated. One set of samples is going to give me one estimate, and a different set of samples is going to give me a different estimate. So, that brings us to the next layer out, which is this data set is not a fixed quantity. It's sampled randomly from the distribution. And so I might get good data sets, data sets where the estimator is close to p, and I might get data sets where the estimator is far from p. So this says what is the probability of a bad data set for sample set. Okay. So what, so that parses the left hand side. The right hand side is abound on this probability of getting a bad data set. A data set where the resulting estimator is highly inaccurate. And we can see that, that probability grows exponentially, or shrinks rather, exponentially in the number of samples M. On the other hand, it also shrinks with epsilon. So the higher, the lower the tolerance that we have for errors, the higher the probability of making an error of that magnitude. So if we need really, really tight bounds, then that, we're not going to get that with very high probability. Nevertheless, some think the probability that shrinks exponentially with a number of samples looks really good, right? The second bound that has a very similar form is the turn off bound. And, the turn off bound has exactly that same composition. So here's our estimator again, and here this is epsilon away from P, but whereas this was an additive. This is an additive distance. This is the multiplicative distance. This is p times 1 minus epsilon, and is the lower bound, and p times 1 plus epsilon is the upper bound. And once again this is the probability of getting a sample, that a bad sample set. And once again we have an exponential we have a bound on the error, which is written over here. And once again, the number of samples appears in the exponent that makes us happy. We have the same type of epsilon squared term, that shows that the dependence on the tolerance that's required. But her we have this one other term over here, which is the actual value p to which we're trying that we're trying to estimate. So if we can give you a bound on the error that you get the this bound on on how far you are away from p. A bound on the tolerance epsilon. And that also the bound on the probability with which you get a bad sample set. We can now say, for a given tolerance that we want. I'm sorry for a given error probability that we want, that is. How, if we want to guarantee that we're within epsilon, with probability that's greater than one minus delta, we need this many samples. And that's fairly straight forward algebra. You simply, say this is less than or delta. And then just take logs, and move things around and it all comes out, to be exactly this. And for the turn off bound we have, a similar expression. and, that gives us exactly that same kind of, bound on m as a function of epsilon in delta, and in this case p as well. So this looks great. Right? We can you, you give me an epsilon which is your air tolerance and a delta which is the probability which you are willing to take being wrong. And I can say if you can only sample this many samples m then I can give you those probabilistic guarantees. They're all deterministic but they're, but they're pretty solid. Why is this not a perfect solution to our inference problems? Because each of these has significant limitations. So, let's think about the first which is our additive bound. And let's imagine that you're going in to a doctor and you're saying, you know what is the probability that I have some horrible disease. Well, that probability hopefully for you, is still pretty low. So maybe, if you're unlucky it's, I don't know, 1% or 2%.. Well, an additive error epsilon on 1%, the, the epsilon that you need to get something that's meaningfully bounded when the true probability p is 1%.. The epsilon needs to be really, really small. You can't do epsilon equals 0.01. Because that could move you up from 1% to 2%.. And that's a factor of two increase in your probability. And that's something that people really care about the difference between 1% and 2%. so you'd need something that's more like 0.0001 or maybe 00001 depending on, you know, how confident you wanted to feel. And now, this epsilon2. squared over here it was beginning to look pretty, pretty scary in terms of the number of samples that are required to get a reasonable doubt. So you might come and say well, fine. Let's use the turn off bound, because that gives me relative errors on epsilon. So now if epsilon is, is, sorry, sorry on p. So if now p is small, then by all means I can go ahead and just, you know, have a, it, it's a multiplicative factor of p, so I can say p plus or minus 1% of p. Well, unfortunately, there's no free lunch because if p's small, notice that it appears here in the denominator, and so once again we need a number of samples that could potentially be quite large when we're dealing with small probabilities. So the main message from this is that sampling-based estimation. Is a reasonable thing to do when p is not too small. When p is not too small this works fine. When P p begins to get smaller, we the, the, the tractability of this is more in doubt. Now that we understand when we might expect this to work let's think about how we might apply it in the context of Bayesian Networks. So here is our little baby network that we've used for the student example. And what we'd like to do is we'd like to generate samples from this distribution. So this is the distribution of, remember, P( D, I, G, S, L). And we'd like to sample from this high dimensional distribution. Not that high, but still. And the way in which this is done is actually very natural. When you think about the sort of, causal intuitions, or forward flow of a Bayesian network. We start out, say, by sampling the difficulty variable. And the difficulty variable was sampled from this distribution over here which is, 0.6 versus 0.4. And so, we use the trick that we showed a couple slides ago. And say, it comes out to be zero. I'm going to write down the zero over here. Now I'm going to sample I, and I'm going to toss a coin with probability 0.7 0.3, and say I won. Now I get the sample grade. And because I previously sampled difficulty and intelligence, I know exactly what distribution grade needs to be sampled from. It's the distribution I1D0. And so I now sample one of the three values of grade from the distribution in this row that I've picked out in the [INAUDIBLE]. And say, it comes out G1. And I proceed to do the same. So s is sampled from this distribution and say it comes out say s0. And g is, I'm sorry l is sampled from this distribution and say it comes out l1. And that's a sample. An I can do the whole thing all over again. And, so I can go ahead and erase all of the decisions that I made before, and I can do the exact same thing so the blue sample might end up for example, with d1 i1 and now I use this distribution, and I'm going to end up say with g2 and I'm using this distribution, let's say I get s1 and I use this distribution and I get all. And I can use this procedure to generate as many samples as I want using a very efficient forward sampling process where I sample in what's called topological order. Topological mean and I start from the roots and go down, and I always sample parents before their children. And so if I want to use this process, which is very naturally called forward sampling. Where we say, want to compute some, to estimate the probability of some assignment, little y, to some set of query variables, y. Then what we're going to do is we're going to generate a bunch of samples from that Bayes net. As many as we think is, is adequate. And then, if I'm interested in some particular event. Then I simply compute the fraction of my samples. Fraction of samples where y equals y. And I can use that same procedure for computing other expectations, so whatever, if I, if I have any other function of the sample, I can compute the sample, the function on that sample and then average it out in exactly the same way that we showed on the first slide. And, and now we can go ahead and apply the bounds that we showed before. So for an additive bound we have this many samples that we need. And for multiplicitive bound, we have this many samples that we need. And these are just copying over the two bounds that we showed before. So, that's great. but that, notice, gave me queries without evidence. And what do we do if we want to now ask a question. Not just about the probability of y. But the probability of y, given some evidence, E = e. Well, so now, it's not that easy to sample from a Bayesian network if I have evidence that's not at the root. If I have evidence that's at the root, then that's fine. I know the value of that variable, and I can just stick that in, and ignore it. Ignore everything else. But if I have a variable where, that I observe that, somewhere in the middle or the bottom of the network. Then what happens when I get there? Do I sample it? Do I ignore it? I mean it seems a bit unclear. And so the simple solution to that. And we'll talk about others. Is an algorithm that is called rejection sampling. And what it does, is it does the most naive, thing that you could possibly imagine. I generate samples from a basna. And then, if they don't agree with my evidence. I throw'em out. And that course the remaining samples, though the remaining samples after I throw out all the ones that are inconsistent with my evidence, are actually samples, from, the conditional distribution. Because this is the same as basically reducing the distribution to sort of ignore the parts that are not consistent with my evidence. That's great because now I have samples from the right distribution. And once I have samples from the right distribution I can go ahead and compute the fraction where y equals little y or for that matter any other expectation that I care about. Good. What's the problem with this? Think about how likely I am to keep a sample. A sample is going to be consistent with the evidence, with a probability which is the probability of the evidence. Right? That's sort of the definition. So, the expected fraction of samples that I keep is exactly P(3). So now, let's go back to, for example a medical diagnosis setting or, or for that matter the the, the, message decoding example that we talked about previously. How likely is your average evidence. So let's think about medical diagnosis.? A person comes in, and they have their age, and their weight, and their gender. And you know, 32 symptoms, and ten test results. And you know, the fact that they were, they went on a trip to Europe last week. And all of these are pieces of evidence. And the question is how likely is that configuration of evidence? And the answer is vanishingly small. How likely is your random person off the street to exhibit this exact combination? Almost null. And similarly with the message example how likely is, are you to get up a particular configuration of noisy bits exponentially small also. So the number of samples needed grows. Grows exponentially with the number of observed variables. The more variables you have you've observed, the lower the probability of the evidence. And this is a exponential decay. And so that means that if you observe more then a couple variables, basically this is not a good algorithm. So to summarize algorithmically, this is a very simple procedure. It's very easy to generate samples for a Bayesian network and we showed how all these pieces fit together. And once you've done that, there's really elegance and theoretically compelling epsilon delta bounds, as we've shown before; but their usefulness, unfortunately, is limited. The additive bounds tend to be useless for low probability events. The multiplicative bounds. are not good either, because the number of samples grows as a function of one over the probability of the event that we're trying to estimate. And all of this is completely irrelevant as soon as you have evidence. Because as soon as you have evidence the number of required samples grows exponentially with the number of observed variables. Which means this is not a feasible solution for most practical settings. And one final note that's worth highlighting. This is the only case in, I think this, the entire section on inference, where I talked about bayesian networks specifically. And that is because, forward sampling, unlike any of the other inference algorithms that we are going to talk about, is just not feasible for markup networks. Because there's no, you know, there's no notion of starting at the root. There is no root. There is no, sort of Any variable that you would naturally start within whose probably you actually know. and so we don't apply forward sampling to Markov networks.