We're going to talk about an area of machine learning called reinforcement learning. This is an area of Machine Learning that recently has generated tremendous excitement. In that, it was one of the fundamental technologies in a machine learning solution for playing the game Go. Go is an ancient game from Asia, which consists of a very complex board of two colors. You can see that on the cover of a Nature article here. So very complex board game. This is a game that previously was thought that there was no way that a machine could play this game as well as a human. What was demonstrated using reinforcement learning and deep learning was that indeed, it is possible to develop a machine that can perform and play the game Go with the proficiency, which exceeds that of humans. So this was a fundamental breakthrough in Machine Learning and it's generated a lot of the excitement that we now see in Machine Learning. Interestingly, it was based largely on reinforcement learning using techniques that we're going to now discuss. So this gives us hopefully some motivation for trying to understand reinforcement learning. The problem I guess that one might realize real quick is if you try to study reinforcement learning is that it is a rather mathematical field. So you have some rather complex mathematical equations, which you can see here. The point of this slide is not to scare you, it's just to show that if we attack the problem of reinforcement learning or try to study reinforcement learning, let's say head on by trying to understand the mathematics, there's a danger that we may lose the opportunity to understand the underlying foundations of what reinforcement learning is doing, which actually, is far simpler than these equations might imply. So what I'd like to do is to walk through an example that should hopefully be intuitive to most people, which will give us a sense of what reinforcement learning is doing. So let's consider a situation of a medical doctor or an MD, who is tasked with trying to treat a patient. So the goal here is that the doctor would like to develop a policy or a framework by which in this case, she can optimally treat the patient. So what we mean by optimally is most effectively from the standpoint of improving the health of the patient but also doing it in a cost effective manner. So we assume in our setup that the doctor has a set of actions that they may perform. So you can think of actions in the case of the doctor might be medications that would be prescribed, procedures that might be performed. These are actions that a doctor may take. Each of those actions will in general impact the state of the patient. Here, the state of the patient is characterized by the health or by the parameters or the data of health that is presented to the clinician. So what the doctor would like or desires is a policy that achieves a good balance between achieving good outcomes for the patient, and then also doing this at low cost. This could be applied in many medical settings for example, in the control of diabetes or also in procedures or optimal flow of workflow in an operating room. So this is a very fundamental problem in medicine as we'll see in many fields. So the way that we're going to set this problem up is that our doctor is going to see or observe the patient and in particular, we'll observe the state of the patient. That state is represented by a vector, which we'll call S. The state of the patient are whatever variables we have to characterize their health. So for example, there are temperature, blood pressure, perhaps laboratory tests, et cetera. So the state S are whatever data that we have to characterize the health of the patient. So that we'll call S, and that is observable to the doctor. Then we assume that the doctor has a set of actions that she may take. We'll represent that set by A. Then there are a set of actions a_1 through a_m that the doctor may take. So these actions may be again, prescribing a particular medication, suggesting a particular procedure, et cetera. These are actions that a doctor may take. Now, there is randomness in how a patient in state S. So given a patient in state S, there is some uncertainty or randomness in how that patient will react to the action that is selected. The reason that that randomness is present is because there may be some randomness to the disease itself. But in addition, the state S, which is a vector of some parameters that characterize the health of the patient, those parameters may not be sufficient to characterize every aspect of the health of the patient. Therefore, there are some missing aspects of the health of the patient, which are not captured by S. Those missing aspects of the health manifest some randomness because those missing characteristics are different from patient to patient. So we introduce a probability distribution, which we represent by P, which is the probability of when the patient is in state S and the doctor takes action A that the patient will transition to state s'. So s' is a new state of health of the patient. So for example, the patient is presented to the doctor with a certain state of health S, which are characterized by their parameters, for example, again, the temperature, blood pressure, et cetera, of the patient. The doctor then takes an action. For example, prescribes a medication, and then the doctor sees what happens. Then in this case, the state of health of the patient changes to s', which might be a change in the temperature, change in the blood pressure or what other laboratory tests we have. So given this probability P(s,a,s'), what we would like to do is to devise optimal actions for the doctor to take. So here, to define what we mean by optimal, we're going to define a reward function, R(s,a,s') is the reward provided whenever the patient is initially in state S. The doctor suggests action A, and then the patient goes to state s'. This is what we call a reward. So for example, if the state of health improves, in other words, if s' is better than S after taking action A, then the reward should be high. If the state s' is worse for the patient, then it was at state S, the reward should be low. This reward should take into account the state of health of the patient. Initially, state s subsequently state s', it also should take into account the cost of the action. So this reward takes into account the reward to the patient from the standpoint of health but also the cost of providing that care. So the thing to note about this is that this reward function R(s,a,s') reflects the immediate impact to the patient of taking action A when the patient is in state s, and then the patient transitions to state s'.