High and welcome to today's lecture about evaluation. In this second part of the course, we want to talk about how to evaluate machine translation before. In the next part, we will talk about statistical machine translation. We want first to find out how we can measure if a translation system performs good, or if it performs bad. So we want to measure the quality of the translation generated by the system. Why do we need to do machine translation evaluation. So there are mainly three points where we need evaluation of machine translation. The first thing is the application scenario. So if you want to now apply machine translation in one use case, you want to use it to help tourists, or you want to translate newspaper articles, you want to know what is the performance of the current system? Is it good enough to give it to the customer, or should we not deliver it yet, but continue developing the system. So the first main reason to use Mta evaluation is to test if a systems quality is good enough to give it to the customer. A second important point to use empty evaluation is in research. So if you are working on machine translation, you want to know, do I improve over the time? You will have to show, did I improve in the last year, the quality of Mt. And in generally, it is common knowledge that you can only improve what you can measure. So it is very important to measure progress over time and to decide if an approach a better than an approach B. Should we work on approach A, or should we work on approach B? Furthermore, you want to know where the errors of the system is. So you can directly know on which parts of the system you have to work and which parts you have to improve to get rid of these arrows. A third point where we want to use machine translation. Evaluation is the system building. When we are building the systems, we will use machine learning techniques, and we can directly train these systems towards and metrics so we can train the systems in a way that it optimizes these metrics. During today's lecture, we will see that the evileration of machine translation is a quite difficult problem of its own. So it is not easy to decide if a translation is good or bad, but therefore we have similar problems as we have in machine translation itself. And we will look firstly at these difficulties, and then present first approaches how you can evaluate machine translation in January. There are two main approaches how to do evaluation. The one is the human evaluation, where humans are asked to evaluate how good the machine translation is. And the second approach is an automatic evaluation, where we try to develop programs which can automatically estimate if the quality is good or bad. In today's lecture, we will present different methods, how you can do this evaluation by humans are in an automatic fashion. So let us start today's lectures with the difficulties of evaluating machine translation output. And again, here one may, in reason why it is difficult is that language is ambiguous. So when you are generating a translation of a source text. There is not one correct answer, but there are different ways to translate it to the target language. And often multiple of these translations are correct. An example is giving here on the right side. So if we have a source sentence where the reference translation is, everything is different. Now, this is one correct translation, but it is not the only correct translation. For example, if the system generates a translation. Everything has changed. This is a correct translation, and therefore should be rated as a good system output. But if we just compare the words used in the reference sentence and in the hypothesis, we will see that there is not really a lot in common. So except for the word, everything that we use completely different words. So it is not easy to decide for a computer that these two sentences has the same meaning. On the other hand, we can have very small changes in a sentence which completely changed the meaning of the sentence. So in this case, it is also difficult to decide if the meaning is the same, or if it is a good translation or not. Again, we have here on the right side, an example where a translation system outputs a source sentence where the reference translation is, again, everything is different. Now, to the translation, nothing is different. Now, this sentence has a complete different meaning than the original sentence, but only one word is different. So only by changing everything into nothing, we change the complete meaning of the sentence. A third difficulty when doing evaluation of machine translation is that it is often subjective. So it is not always clear. Is it still the same meaning, or is it already slightly a different meaning. If we again look at our sentence and then look here at the right side at the example, some things have changed. It is not completely the same as a reference, but it is at least a similar meaning. So it is no longer clear, is it now a correct translation, or is it not a correct translation? Some people might say ," Oh, it is still C- meaning mainly the same, wherever people will argue that it is a different meaning, and therefore it is a wrong translation. And finally, the evaluation also depends on the application scenario, how good or bad a translation is, or how useful it is also depends on where you apply your machine translation system. If you are forgetting to put a knot in your sentence. The meaning is completely changed. And therefore, if we use a Mt system to translate some data, the output will U- be used less, and the system should not be used in conference. If we use this system, for example, as a preprocessing step, and then a human post - processor will generate the final translation. He can quite easily add the knot in there. And still the translation might be helpful. So it really depends on what is your application scenario when considering every translation is useful, or as the translation is not useful when distinguishing different approaches to evaluation of machine translation. One main difference is a difference between a human evaluation on the one side and an automatic evaluation on the other side. If we are talking about human evaluation, we are meaning the following scenario shortly described here on the right side. So we are using a machine translation system to translate our German data into English, and then a human expert is rating this translation to be good or bad. This, of course, is a gold standard, because that is why we are finally interested is the output generated by the machine translation system useful for a human. So this scenario is something that what we want to achieve in the end. On the other hand, this human evolution has also some disadvantages. On the one hand, we already saw that it is somehow subjective. When we are looking into that, we will see that even for a human, it is a very difficult task, and humans often do not agree if something is a correct translation, or a a bad translation, or if they have to rank system a system A better than B, a lot of people will not agree. A second disadvantage is that this approach is very expensive and time consuming. So not in all scenarios. We have the resources to do a real human evaluation, especially if we want to try a lot of different experiments we cannot do for all these experiments as human evaluation. Therefore, there is a second group of evaluations, which we are calling the automatic evaluation in the automatic evaluation. We are trying to replace a human here at the end by a computer. So we again generating an automatic translation from German to English. And then we are using a computer to evaluate, is this a good translation, or it is a bad translation. Of course, this is now really difficult. How should we have a computer program to evaluate if this is a good translation or a better translation. If we could have this computer program, maybe we would directly use it during translation, and hopefully we would generate a better translation. Therefore, in this scenario, we normally accept to have additional resources. So normally, in addition to the translation generated by the computer. We have also translation generated by the human, which we accept at the gold standard, and then the computer can compare the human translation and the translation generated by the computer to estimate if the computer generated translation is better or worse, of course. Now, the question is, what do we gain since we again need a human to generate the translation. But the main advantage here is that we only have to generate these translations once. So for one tense test set, we will have a translator which translates it into English. And then we can evaluate a lot of different Mt systems and always compare it to the gold send so we can do a lot of experiments without having X as human effort. Therefore, compared to a human evaluation, the automatic acceleration is often cheaper and a lot faster. On the other hand, the question is, of course, is the performance as good, since we are not in the end, having a human evaluating how good the output is, but a computer model. It Coni- can only approximate the human, and maybe we will introduce here additional error, and the sister will tell us ," Oh, this is a very good translation, but in reality, it is a quite bad translation. In addition to having this main categorization between human and automatic evaluation, their additional difference between different approaches to do evaluation. The first thing with different evaluation approaches differ in is the granularity. So sometimes you are interested in having a sentence based evaluation. So for every sentence you want to know, is there a good translation, or is it a bad translation? So for a large document, we will not only have one score. This is a good translation, but for every sentence we will know if this sentence translated "good" or "bad. While this is often done in human evaluation and allow us more error analysis. So you can more detail look where errors happens to improve your systems for automatic evaluations. This is often very, very complicated, because for one sentence, it is more difficult to decide if it is a good translation or a bad translation, then averaging over a large document and deciding for the whole document is a good translation, or it is a bad translation. Therefore, automatic matrix often are using document based evaluation metrics. In this case, we are not giving a score to every sentence, but we are only evaluating the whole document. So we are only saying ," Oh, this document is a good translation. Are this document as a bad translation. In this case, automatic metrics often work better, and that is the reason, while in a lot of cases for automatic machine translation evaluation. We are using document - based evaluation matrix. And finally, there is a task based evaluation metric. In this case, we no longer looking in this translation really good? Or is this translation really bad? But we are just looking at the overall application. And then we are saying, can the user fulfill his task, or can he not fulfill the task? So you mention you are using a translation system in a patient doctor scenario. So the doctor and the patient are only communicating using a machine translation system. In this case, you might not be interesting. Is the translation really good or bad, but in the end, you want to know, could the doctor help the patient, or could he not help a patient, and therefore you are just evaluating was the treatment successful or not, or imagine a machine translation system is used in a lecture scenario. So the lecture is translated by a machine translation system. And you later just ask the students questions to the lecture, and they have to, I answer that. So if they understand the lecture, the Mt were successful. And if they don't understand the lecture. The Mt wasn't successful. The main advantage here is that this is the final goal. So often you are not really interested. Is it a good translation or a bad translation, but is the empty system helpful for the user, or it is not helpful. So you are really evaluating your final goal with this approach. The problem here is that it is first very time consuming, and it is often needing a lot of effort, because you not need only to build your empty system, but you need to build a whole infrastructure. Furthermore, the result might not be only depend on the machine translation system. It might depend on a previously used automatic speech recognition system, or the students just weren't interesting in the lecture. And so they didn't follow the lecture. So there might be other reasons, and it is not a good indicator to find errors. But since it is very important for the end application is often also done to really evaluate the whole task. Another point when looking at machine translation evaluation is to look what properties of the machine translation output is evaluated, and the two main thing which often get evolved is on the one hand, the adequacy, and on the other side, the fluency. So let us first have a look at the eloquency integrity here means that the question is, is a meaning translated correctly. So if we, for example, look on the right side. We have again our reference. Everything is different now and then we might have a translation different as everything. Now, of course, this is not really fluid English, but the meaning is translated correctly. So in this case, we have adequate translation, but it is not a flu in one, the other probability of the empty system, which we are often looking at is a fluency. So is the output that is generated fluent, because often you don't want to have only a adequate translation, but it should also sound like real English. So the question is, if the output really fluent. And again, we can look here on the right side at one possible translation, which is nothing is different. Now we have this already. It is not an integrated translation, but of course, it is a completely fluent translation. So sometimes it makes sense to analyze those independently, the adequacy and the fluency to find which arrows are cure in your empty system, and then you can directly work on these errors. And finally, you can evaluate your machine translation output by doing an error. So why do we want to do an error on our lizards. First of all, maybe we want to have interpretable results. So we want to really know what are the arrows, not only having like one score telling us ," Okay, the quality is zero point nine, but we really want to see you are doing these types of errors. So if we want to identify the most prominent errors in order then to develop methods to improve these errors. It is often helpful to do an error and others of your anti system. One thing you can do here is to do an error classification. That means you are looking at the output, and you are not only saying this is a good translation, are a bad translation. But for all the translations where they are errors. You are classifying the errors, typical errors. What machine translations do are then, for example, to identify missing words are incorrect word order, or you have an incorrect translation, which means you translated a word with the wrong word, and the target side, and so on. So these are typical arrows that machine translation systems do. And if you identify what aerospace systems did. You can then directly work on these arrows. Another way to do this error analysis is to use test suites. In this case, here will generate a test suits where in every sentence, different phenomena now, which a current language are typical. And then you can look at your translation system when it translates these sentences to see if it does the typical arrows on these sentence, or if it is able to not do this errors. So for example, you can have some senses with negation to check if your system correctly translate these negations, or you can have a test sentence for pronounce to see as a system is able to correctly translate this pronounce. It is also good to continuously past your system and prevent it to doing errors. You are already corrected previously. Because if you have your test with all the arrows you have corrected, then you can make sure that these errors do not occur again in your system. So in summary, today we saw that empty evaluation itself is a very difficult prop. And the main reason for this is like for machine translation, that language is ambiguous. And so you are not having one correct translation, but you might have several correct translation. And on the other hand, you may have translation, which are very similar, but where the meaning is completely different. The two main methods to do evaluation of Mt systems is on the one hand, the human evaluation. And on the other hand, the automatic evaluation. Finally, we saw that there is different granularities of to do young machine translation evaluation. You can do a sentence wise evaluation, where for every sentence tell is this is a good translation or a bad translation. This is very useful if you want to do an error on others. For example, on the other hand, you have document wise an evaluation where you only say for the whole document. If this is a good translation, or a better translation, or is system a better on this document, or a system B, better on this detriment that doesn't mean that system A is better on all the sentence since. But an average system A is better than System B. And finally, we also have task based evaluation, where we are not evaluating the Mt system itself, but the whole task. And as a user able to fulfill his task with the help of the machine translation system. And finally, we looked at two important properties of your empty output that is on the one hand, the adequacy. So your translation should have the same meaning as a source sentence. And secondly, at the fluency. So the output should be fluent and not some words in a wrong order. For example,