So far we've talked a lot about the representation of probabilistic graphical models, we define different classes of probabilistic graphical models, Bayes net, Markov net and talk about their independence assumptions. Now let's operationalize probabilistic graphical models and figure out how to use this the clarative representation to answer actual queries. It turns out there is many queries that one can use a PGM to answer but perhaps one of the most commonly used one is what's called a Conditional Probability Query. So let's define a Conditional Probability Query. In a Conditional Probability Query we have some set of observations. E, little E, about a set of variables, big E. These are the variables that we happened to observe. We also have a particular query that we care about, which is, usually which we are going to denote as a set of variables, Y. So our goal here is to compute the. conditional probability of the variables Y given the evidence equals little E or a conditional probability. This type of query is useful for a variety of different applications. For example, in the medical or fall diagnosis domains that we've spoken about we might have observations about certain symptoms, or test results and we're interested in predicting the probability of different failure modes or different diseases. In the pedigree analysis example that we also talked about we might have observations about the phenotype or maybe even the genotype of some individuals. Family and are interested in reaching conclusions about other individuals. So, unfortunately, like most interesting problems the problem of inference and probabilistic graphical models is NP hard. So, before we talk about that, let's first remind ourselves what NP hardness is. I'm not going to define it formally in this course. But a, as a rough intuition, if a problem has been shown to be NP hard, it means that it, it extremely unlikely to have an efficient solution. Slightly more formally, it means that all algorithms that people have devised for this problem, and a bunch of problems that are equally hard, require at the very least time exponential in the size of representation of the problem, which means that it's unlikely to be solvable for any problem that is bigger than a, a small handful of say variables in our context. So which particular problems in the context of PGM inference or MP hard? Well, first, the basic problem of exact inference in the PGM is MP hard. So given a PGM, P of Pi defined by a set of factors phi. And a particular target variable x. And a value, little X. Computing the probability of the event, Xlittle equals X, is MP hard. In fact even a simpler special case of this problem. One where we just want to figure out whether is probability is even positive. That too is [INAUDIBLE]. So one might then say well okay, exact inference is is intractable. But what about if we compromise a little bit on accuracy. What if we're just looking for an approximate answer. After all, we don't necessarily care about the tenth significant digit in this probability. Does that make the problem easier? Unfortunately, turns out that the answer is no. And there is many different variants empty hardness results for approximate inference. This is just one of them. This one says that, again, given a PGM, an event X equals little X and some set of observations. The problem of finding. A, an approximate answer P. For which we are guaranteed that this approximate answer is within epsilon of the truth, is empty heart. And this is true for epis, for any epsilon that's less than 0.5. And note that epsilon equal to 0.5 can be obtained by simply random guessing. Or guessing the value equal to 0.5. So any non-trivial approximation for this kind of Conditional Probability Query is also intractable. It's also a rather depressing observation, and might want, might want to make us give up on using PGMs for inference. But it turns out that one shouldn't get depressed too quickly, because importantly, this is a worst case result. Which means we can construct these bizarre PGMs for which exponential time is the best we can do. But that doesn't mean that in the general case one can't do better. And the rest of the inference section of this course will be devoted to showing a variety of algorithms that whether for exact inference for approximate inference, can do considerably better than this worst case result can suggest. So, first let's to understand where these results might be getting their, where these algorithms might be getting their power, let's drill, drill down into the inference question a little bit more. And here is a slightly elaborated version of our student network where we now have additional variables. For example, the coherence of a course influences its difficulty and a variety of other variables that we're not going to talk about. So. Difference in, this model and in general inference for probabilistic graphical model uses, the notion of factors that we've talked about before and it turns out that it's convenient to use factors because it means that the algorithms apply equally well. To bas net and mark up networks, and, and it's a very useful extraction. So let's think about this, basion network in the context of a set of factors, so here for example we, had, initially, a prior probability over the C variable P of C, that's, translates into a factor, whose scope is C. We have, for example, here Q of G given I and D, and that translates into. This factor whose scope is g, I, and d. And in general each one of the CPDs in this network converts to a factor over the scope of the family that is the variable and its parents. So now let's assume that our goal is to compute the probability of the variable J, and so we'd, we'd like to compute p of j. And what we see here is the joint distribution. So this is the joint distribution which we have defined using the chain rule for Bayesian networks. In order to compute P of J what we need to do is only to eliminate or marginalize out all of the variables except for J and so we end up with a summation that looks like this which is why this problem of inference of con of conditional probability inferences often called sum product because we have a sum over product to factors. The inference problem for mark of networks takes exactly the same form so here we have once again a product of factors and in this case the factors are the formen which the network are is originally defined. So we have the factors fi one over AB fi two fi three and fi four and if we are interested in computing the probability P of D then once again we need to compute the product of these factors and then marginalize out. Now but what I wrote here is not actually quite right because the pro the product of the factors isn't actually in mark of networks P of ADCD rather its P tilda of ABCD which is the unnormalized measure and in order to get the normalized measure we need to norm we need to normalize or. Divide by the partition function. So, how do we deal with that if our goal is to compute p o d? Well, the, point is that if what we have computed, if we ignore the partition function, and rather we compute p [INAUDIBLE] of d. We can infer that p of d is actually equal to one over Z, p [INAUDIBLE] of d. Because the normalizing constant is a constant. And so if we've computed p [INAUDIBLE] of d, we can obtain p of d, by simple process, of re-normalization. Of p tilda of d. What about evidence? Well, evidence it turns out can be done by simple pre processing step of factor reduction. So here if we're trying to compute the probability of a set of variables y given evidence. That, by definition is the ratio between the probability of y and the evidence, divided by the probability of the evidence. And, if we look at the numerator of the expression over here. And define a set of variable, define w to be all the variables that are neither query. Nor evidence. Then, we can, once again, consider this as a sum-product expression. So, P of Y and E. We are going to introduce the variables W into this expression. So this probability over here is the sum of this probability marginalizing out the W. Now if we now ugh rewrite this expression. Over here we can view it as a product of factors. And that's case whether it's a Bayesian network or a Markoff network. And that pro, product of factors is only those factors which are only the components of those factors that are consistent with my evidence E equals little E. Which means I. reduce the factors by the evidence. So to understand what that means let's look at this example over here. and let's imagine for example, that we have an observation Say A equlas little A and we want to compute the probability of the distribution in the context of this observation. What that means is we're going to take every single one of those factors say A equals say A equals A0. We're going to remove from every factor that involves a, the entries of course correspond to A equals A1. because those are not going to be consistent with our observation A equals A0. Once we've reduced those factors to be consistent with those evidence, we have now a, still a product of factors and we can go ahead and treat it in exactly the same way as we did before. And so we now have a sum over the variables W that need to be eliminated of the product of the reduced factors and once again we can ignore the partition function by computing this probability and then re-normalizing at the end. , This applies equally well in the context of Bayesian networks. So here we might have, say the observation is I equals little I and H equals little H. So now this is no longer an equality, but rather a because we have not yet conditioned on the evidence. And so if we want to make this equal we have to reduce each one of the factors involving I and H, based on the evidence. And so this turns into phi I of little I. Which, as it happens, is a constant because there's no other variables in the scope, and similarly here. And here and for the H equals little H. We do the same thing, which in this case involves only this factor over here. And now, it's back to being an equality. And if we want to compute probability of j, given. I equals little I and H equals little H, we take this summation. And re-normalize. Just like before, so to summarize the sum product algorithm can be done as follows. It looks at, you convert the conditional probability into a ratio that the numerator of this. Ratio is a product of reduced factors summed out over the remaining variables. the nominator. Of this is simply the numerator, summed up over the variables, over the query variables. Why? And if we divide these two together, we realize that we can get away with computing simply this product of reduced factors, and normalizing at the end. There are many algorithms for computing Conditional Probability Queries. one of those involves pushing the summations into the factor product, this gives rise to an algorithm called variable elimination, it turns out to be a special case of a class of algorithms called dynamic programming. And it's a form of exact inference. A generalization of this algorithm performs message passing over a graph that also effectively deals with summations and factor product and factor summation. And there is many variants of this algorithm, some which are exact and others are approximate. Here are some names of algorithms only some which we'll have an opportunity to discuss. And then finally there's a very different class of algorithms that uses random sampling as the key for as the key technique and it's sample. Complete instantiations or assignments in a variety of different ways. And uses those assignments to approximate the probability of, of a particular query. So here we have. Both exact and approximate. Methods and this is an approximate method. And we'll talk about some of these in the remaining section of course.