So we've previously seen the notion of Pairwise, Markov networks. but now we're going to define a much more general notion, that is considerably more expressive than the Pairwise case. And that definition is called the Gibbs distribution. So in order to motivate the notion of a Gibbs distribution, let's look at the most expressive Markov network that we could possible define in the context of pairwise interactions. So here we have four random variables, a, b, c, d, and I've introduced all of the possible pairwise edges between them. And so the question is, that we'd like to ask ourselves is, is this good enough? So, is this fully expressive? Or in other words, can it represent any probability distribution over four random variables. So one way to convince ourselves of, of whether it can or it can't is to go and do the general case and just look at a little bit of asymptotics. So consider a fully connected pairwise Markov network over n random variables, and let's assume that each variable xi has d values. How many parameters does the network have, and for or the moment, let's just focus on pairwise interactions. sorry, just the pairwise potentials. Let's ignore the potentials associated with single denotes, and also, let's do a little bit of analysis. If we have n variables, how many edges are there in the network, how many parameters per edge? So, the total number of parameters in a fully connected Pairwise Markov's network, is O2*e^2. of N squared times e squared. Right, can we represent any propability distribution, using O of N squared times D squared parameters? How many parameters are there in a general probability distribution over N random variables, where each has D values? How many free parameters do we have? Well, this is much, much bigger than that, which means that, if you just even thinking about it intuitively without getting into formal arguement, Pairwise Markov networks are not sufficiently expressive to capture all probability distributions. So how do we increase, the, the coverage of this undirected representation? We need to move away from Pairwise edges. So, in order to parameterize what we call a general Gibbs distribution, we're going to parameterize it using general factors, each of which has a scope that might contain more than two variables. So whereas before we just had factors over pairs, now we have factors over triplets, quadruplets, and anything else. Can we now represent every probability distribution? Well, sure, because we can have a factor over all N variables together, and by, you know if I immediately define, allows us to define a general probability distribution. In fact we even said that a probability distribution is a factor who's scope is X1 after XN. So in order to define this general framework more formally a Gibbs distribution is parametrized by a set of factors phi. And we're going to define this distribution in two steps. Three steps, even. The first thing we're going to do, just like in the case of Pairwise Markov networks, we're going to take all of the factors. So here we have k factors, and we're going to multiply them. And this is just the familiar operation of factor product, which we've seen in multiple contexts before. Now, this is perfectly fine, but the problem is, that this is not necessarily a probability distribution. In fact, it almost never will be a probability distribution, because we have put no constraints on the factors, and so, and so what we, that's why we have this tilde here, that highlights the fact that this is what we called previously a unnormalized measure. And so, if we want to turn the unnormalized measure into a probability distribution, just like in the Pairwise case, we're going to define what's called a partition function, which is just a normalizing constant, which we get by summing up over all possible assignments, the variable of X1 up to XM. That partition function can then be used to divide all of the entries in the unnormalized measure to give us, oops, something that shouldn't have a tilde on it, but rather is a probability distribution P sub phi if X1 up to XM. Now that was just a definition of the distribution in terms of the set of parameters. where is the Markov network in all this? So let's think about what is the Markov network that we would like to have for a Gibbs distribution with a certain set of factors phi? So in order to get the intuition let's look at a distribution that innvolves two factors, phi 1, whose scope is A, B and C and phi 2, whose scope is B, C, and D. And, I guess I'm going to use a different color, for phi 2. So, what, edges should the mark up network have when, when we wanted to encode the fact that A, B, and C all get to interact with each other, together? And intuitively, what we then like to have is a network that has an edge from A and B, an edge between B and C, and an edge between A and C. Because that captures the fact that there's direct probabilistic relationships between all of them. What about the other one? Here we have, between B and C, C and D, and B and D. Mm-hm. And, this is the induced Markov network in this particular case. More generally, if we have a set of factors phi, each of which has, or each phi I has a particular scope, DI. The induced Markov network, which we're going to call H sub phi, has an energy between a pair of variables, XI and XJ. Whenever there exists a factor phi I in phi, such that, oops, phi, the phi such that XI, XJ are both in the scope of the factor phi M. That is two variables are connected whenever they appear together in the same scope. Now we can go ahead and turn this around and define a notion, just like we have for Bayesian network, of when a probability distribution P factorizes over a graph H, that is, at what point can I can I represent P over a particular graph H? And, this basically asks the question of is there a set of parameters phi, that are going to let me represent the probability P. So, this is just of a straightforward going through the definition that we had before. Does there exist a phi such that P, is equal to P of phi where P of phi is defined as I, we defined previously as a normalized product of factors. And such that H is the induced graph for the set of factors phi. So P factorizes over H if there exists a set of factors phi, such that I can represent P using those factors, and H is the induced graph for that set of factors. So that's when I can encode a distribution P over a graph H. So, now let's ask ourselves the question. if you give me a graph h. Can I tell you, what the factorization of, a distribution p over that graph would be? That is, which, which of the representation of the distribution that I had in mind when I drew this graph? So here we have this graph, over A, B, C and D. And, which Gibbs distribution would induce the graph H? Well let's look at them, one af-, one at a time. So here we have phi one of, ABD and phi two of BCD. And, we see that there's an edge in D, between A and B. A and B, A and D. And conversely between B and C, B and D, and, C and D. So, the answer here is yes. This distribution would induce this, this set of factors would induce this graph. Okay what about the next one. This, and ask yourself the same question. Well, if I wanted, AB would induce this edge. B, C, yep. C, D, yep. A, D, and B, D. Well, huh. Here's another distribution with a different set of factors that induces the exact same graph. The third one is the same principle. We have the edges A, B, D, A, B, B, B, and A, D. And then we have B, C and C, D. So here's another distribution that induces the exact same graph. What that tells us is that we cannot, and this is important, cannot read the factorization from a graph. That is, we have different factorizations, that are quite different than their expressive power, all of which induce the exact same graph. And we've already seen an example of that. When we had the fully connected Pairwise Markov network, we had one parameterization that had old n2, squared d2 squared parameters. And we had another parameterization that had a fully, a full factor over all n variables, so it had vn to the n parameters. And these are two very different representations, with very different expressive powers that never the less induce the exact same graph. But we must ask the question of why then is the graph the same? What does the graph really tell us? Given that given that it's not telling us the structure and factorization. So, here's is the, here is this, going back to the example on the previous slide. We have these, two factorization, one of which uses triplets, factors and the second one uses Pairwise factors. And lets think about what is the flow of influence in these factors. So when can one variable influence another? And we can see, and we think about this intuitively, when can B influence D? Is this is this different in the two graphs, in, in the two distributions? And the answer is, well, not really. I mean once we have a factor. Here in this case it's phi one, in this case it's phi five, that ties b and d directly, and the fact is that D can influence B. What about can B, can A influence C? Well, so let's, so can A influence C? A can influence C via D by going through, in this case, the ABD factor, and then subsequently uti, utilizing the dependencies within the BCD factor. and in this case, it can use the AB factor. And then the CD factor, and so the point is although the parametrization of the two distributions are different, the paths in the graphs, the trails in the graphs through which influence can flow is the same regardless of this finer grain structure of the factorization, which is why the graphs in those two cases are the same. So let's formalize this definition. were going to define a notion of an active trail in in a Markov network. And this is actually a very simple definition. It's much simpler than the analogous definition in the context of Bayesian network. We have that a trail going from X1 up to XN is active, given, and of, a set of, observe variable Z. If basically no XI along the trail is in Z, because an active trail has to, only flows through variables that are unobserved. Once we observe a variable along the trail, influence kind of stops because that variable is now set and so you can't really influence it. And if you can't influence it, you can't influence anything subsequently along that, along that path. So for example, the trail from B. The D, is active, so this is active, but not if a is observed. So once I observe A, I can no loner influence, B can no longer influence D via a. So, to summarize, we define those in the Gibbs distribution, which represents the distribution p as a normalized product factors. we connected that to a graph structure which is the induced Markov network, which connects every pair of nodes that are in the same factor, and the motivation and although we noted that the Markov network structure doesn't, fully specify the factorization of P, the justification for why the graphs for different factorizations are the same, because the active trails in a graph depend only on the graph structure.