One very important variant of Markov networks, that is probably at this point, more commonly used then other kinds, than anything that's not of this type is what's called a conditional random field. So a conditional random field, you can think of it as a, something that looks very much like a Markov network, but for a somewhat different purpose. So let's think about what we are trying to do here. This class of model is intended to deal with what we call task-specific prediction, that where we have a set of input variables for observed variables, X, we have a set of target variables that we're trying to predict y. And, the class of models is intended to, is designed for those cases where we always have the same types of variables is the instance variables in the same types of variables as the targets. And we're always trying so solve what is essentially the exact same problem or the exact same type of problem. So let's think about that as, let's think about two examples along these lines. So again, going back to the example of image segmentation. Here we have the input is always a set of pixels with their values, pixel values, which we can process to produce more expressive features like histograms of colors and textures, but this is sort of not relevant at this point or although it will be in a minute. So my input x is pixel values and processed features for those and my hardy variables y is a class for every pixel. So for example, whereby class I mean, you know, grass, sky, cow, water, and so on. And we're always trying to, we design the model so as to solve this problem of going from X to Y and we want to solve that problem. In some of the text processing or Nlt problems that we've discussed, the the pixel, the input X are the words in a sentence. And the output that we're trying to derive is, for example, the labels of the words, where again, the, the labels might be in this case things like person, location, organization, and so on. And once again, this is always our input, the words and the output is always the labels. So, why, what is the problem? So, fine. So we want to solve this this kind of prediction problem. Why not use just a regular old graphical model in the same way that we've done so far? So let's think about what some of the issues might be with that. So imagine that we are trying to predict in this case Ci which is the label of a particular superpixel in this case, i. And we're going to go ahead and, process our features of that superpixel to provide a range of different color and texture histograms that that represent you know, different ways of measuring the, the appearance of the superpixel. So in this case, that's my y, just as a sort of, in the notation of the previous slide. This is my target y and these guys down here are my X's. So what's the problem with this? The problem with this is that these features that are often much more informative about about these about the classes of pixel of which are very correlated with each other. So, for example the the texture histogram which tells us sort of a direction in which lines go in a particular superpixel. They're very redundant in the terms of the kinds of structures, the kind of texture that they measure, and so if you have features here that are very correlated with each other, so where two variables or two features have a lot of redundant information, if we represent this in a naive Bayes model as I drew here where the features are independent, given the label, then I'm effectively ignoring that correlation structure. Now, why is that bad? It's bad because it allows me to count the same feature again and again and again and again and again. So if I have five copies of a feature or five very correlated related features that effectively measure the same thing, I count it five times, which means I get really confident because of this one feature. If I have a 100 copies of that feature, it counts it 100 times. So it, and again, it pushes me towards very skewed probability distribution that don't really that are not really good reflections of of my true beliefs, because they make incorrect independence assumptions. Well, fine. So let's, make correct independent assumptions. Let's add a bunch of edges to capture the correlations. Well, that turns out to be really hard, because once you start doing things that have trivial features. figuring out how correlated is hard and furthermore even trying to put in these correlations gives rise to, so first of all this is hard to figure out and it gives rise to densely, connected models. If I start putting in all these edges between these features. So a completely different solution to this problem, basically says, well, I don't care about the, the image features. I don't want to predict the probability distribution of those pixels. I want, I don't care to predict, you know, I'm not trying to do image synthesis. I don't care about the probability of having a, you know, a green pixel next to another green pixel next to a brown pixel. I'm trying to use a pre-given to me, something that is already given, which is the features X, and the only thing I really care about is to model the distribution over y. So I reformulated my problem, instead of modeling, the joined distribution of X and Y together. Instead of doing this, I'm going to model a conditional distribution of y given x. We're now trying to capture distribution over X. Now, if we're not trying to capture the distribution over X, then I don't really care about the correlation between them. We'll see that a little bit more clearly when we look at an example of it, but before we do that, let's first give ourselves the formal definition of this thing, which is called a CRF which stands for conditional random field. And, a condition random feel at first glance, looks just like it gives distribution. Though, just like it gives distribution, we have a set of factors each with their scope. And just like the Gibbs distribution, I multiply, the [INAUDIBLE] to get an unnormalized measure. And, in order, and now so far, it's exactly the same except that I have a x and a y on the, in the, as the arguments of the of a normalized measure rather than just a set of random variables, this is where the difference comes in. If I want to model a conditional distribution of Y given X, I'm going to have to put X on the right-hand side of the conditioning bar, which means I'm going to have a separate normalization constant a partition function that is a function of X. And so what does that mean? It means that for any given X i sum over y, which means that for any given X I'm going to have the sum of all the Y's that correspond to that X, and then I'm going to construct the conditional distribution by normalizing, by I'm going to construct an additional distribution for Y over Y for any given X, this shouldn't have a tilde. by dividing by this X specific partition function. So to be a little bit more concrete, let's imagine that I have a table here and I'm going to make life simple for ourselves, for myself, and this is my table for X1y, X0y0, X0Y1, X1Y0, and X1Y1. And for any given x, in this case this for example these two entries right here. I'm going to sum them up, that's going to give me the normalizing constant for these two entries together, and I'm going to divide by that Z of X zero. And that guarantees that these two entries now sum to one, because I just basically divided by their appropriate normalizing constant. And similarly, [COUGH] I'm going to add these over here and I'm going to divide this entry by Z, of X1. And so now, to find a family of conditional distributions, so this is a family of conditional distributions, which varies by X. Now, it turns out that CRFs are highly related to a model that we have previously seen in, in many machine learning classes. And also comes up as a CPD representation in, in Bayesian network, which is the logistic function or the logistic model. So in order to understand that connection. Let's look at a very, very simple case where we have both Xs and Ys are binary random variables, so they take on values in the space 01. And, we're going to use in our log-linear model a simple representation of my log-linear features, which is just the indicator function. So we have 1, my indicator function is 1 when both Xi and Y when Xi and Y are both 1, and 0 otherwise and that's going to have a parameter, Yi. And now let's actually plug that into the CRF representation and see what happens. So, let's look at phi i of Xi and Y = 1. Now, if you look at that, in this case Y is always 1, and now, we have two choices. Either Xi is 1 in which case the indicator function is 1 or Xi is 0 in which case the indicator function is 0. So you can rewrite this as Wi times Xi, X, sorry. X [INAUDIBLE] Wi Xi. Right? Because if Xi is 0, then you get X of Wi times 0 and if Xi is 1, you get X of Wi. Hm. On the other hand, if I make Y = 0, then my indicator function is 0 because if Y is 0, then the indicator function is 0 and at that point I get exponential 0 which is equal to 1. Kay?' So these are my two cases. I have, what is the value of this factor for Y = 1, what is the value of this factor for Y = 0? And now I can go ahead and compute my unnormalized density and my unnormalized density has the following again, I'm going to do two separate cases. There is the case when Y = 0 and a case when Y = 1. And my unnormalized density in the case Y = 1 is just multiplying all these guys up here. So multiplying all these guys, which gives rise to this exponential, over here. But just because the product doesn't, a bunch of exponentials is the exponential of the sum. Okay? So that is, in this case, and here clearly the product of a bunch of ones is one, so that's pretty straightforward. And so now, putting this all together to produce a normalized density, remember, my goal is to produce a density, a normalized density, which is the P of Y, my binary value Y given all of the features X. And if you just stick this together, this is, when you just rewrite this, this is the probability of Y = 1, X and this is divided by the probability of Y1, = 1, X plus the probability of Y0, = 0, X which is why my normalizing constant. Well, this is a very familiar shape. It's the sigmoid function. It's E to the power of Z, if we call this Z. So it's E to the Z, divided by one plus E to the Z, which is exactly my sigmoid function. So what we have concluded from this, is that the logistic model is a very simple CRF. So broadening from this from this particular mathematical derivation, what we have here is that if we were to think about the model where we have a joint density of Y and X together as a kind of naive Bayes model, because normally we have pairwise terms that relate the Y and the Xi so you can think of it as a bunch. If, if we weren't doing this conditional normalization, we would have a model that looks like this. We would have, you know, the Y and we would have a pairwise feature that relates Y to X1, Y to X2, up to Y to Xn. And, that would be effectively like a naive Bayes model because, since we don't have any features, pairwise features that represents, connect the Xis to each other, we would effectively have a model where the Xi's are independent given the Y. So, if I weren't doing a conditional model, if I were to just use this set of potentials. It represents the joint distribution of X and Y, it would be a effectively like a naive Bayes model and it would make very strong independence assumptions. But because a modeling, this is a conditional distribution, like this, this is the conditional distribution. I've effectively removed from this analysis any notion of a correlations between the Xs and I'm just modeling how the Xs come together to effect the probability of Y. And so, that's really the difference between a naive Bayes model and, and a logistic lagression model. And that same intuition extends to much richer of classes of, of models, where I don't just have binary variables in a single Y, but rather a very rich set of Ys and Xs. And nevertheless, this ability to sort of ignore the distribution over the features and focus on the target variables allows me to exploit allows me to sort of ignore correlations between rich features and not worry about whether they're independent of each other or not. So, for example, going back to are notion of CRFs for image segmentation here, we typically have very rich features of the variables. so for example, we think about individual node potentials that relate the features Xi to the class label Yi, we don't worry about how correlated the features are. We can have color histograms, texture features. You can have discriminative patches, like looking through the eye of the cow, for example. And all of these are going to be really correlated with each other, but I don't care. You can even look at features that are outside of the superpixel. You can say, oh, well, if it's, you know, green underneath in a completely different superpixel maybe it's more likely to be a cow or a sheep because they tend to be on grass. That's cool, too. These are definitely correlated, because you're counting the same feature for my superpixel as well as for a different superpixel. That's fine. I don't care, because I am not worried about the correlations between the superpixels. So the correlations don't matter. You can even, and this is very commonly done, train your favorite discriminative classifier, a support vector machine, boosting, random force to anything that you like to predict the probability of Yi given a whole bunch of image features X. And that's fine, too. And in fact, that is how one achieves high performance on most of these tasks by training very strong classifiers for in most cases, your note potentials. That is the predictors for individual variable and then adding on top of that pairwise features or, or not just pairwise but a higher order feature is between the, between the Ys. it's [SOUND] one, important point here that is, that is useful to make is that the word features are over, is overloaded in this framework and that might be confusing, and rightfully so. there are features in the context of image features for example, like these, guy over here and that's one use of the word features. And the other use of the word features, which is this usage, is in terms of features in my log-linear model. And these Fs that we use to define the log-linear model and features are actually used for both which is an unfortunate thing, but there it is. So, hopefully, this will be clear from context. the same kind of idea applies when we do CRFs for language. Again, we usually have here features that are very highly correlated with each other. So, for example, whether the word is capitalized is correlated with whether the word is in some atlas or name list, definitely correlated features correlated with whether the previous word is missis or mister. All sorts of other features that are often very correlated with each other. You can even see that the same, that this word is used as a feature for more than one word. That's fine, it's not a problem, because we don't try and model this distribution over the words in the sentence but rather the probability of the labels given the words. So to summarize a CRF is deceptively like any other Gibbs distribution, but a critical, a subtle but critical difference is that it's normalized differently. It's normalized so that you're creating a conditional distribution of a Y given X. As we've seen that as a special case, it subsumes your standard logistic regression model, but has a lot of other but has a much richer expressive power. A key feature of it is that we can, we don't need to model the distribution over variables that we don't care about, only the ones that we actually care about predicting. And that a critical utility of this is that you can design really, really expressive predictors of pieces of the model without worrying about incorrect independencies between between different variables, which would be inevitable if you actually tried to model this distribution, joint distribution over these expressive features.