Once we have completed the extraction phase of the ETL process, then we can move on to transformation. During transformation, we're going to be making changes to the data to improve its quality. We may initially look at the data and try to identify anything that perhaps looks wrong that we can correct. And in some cases, we may not be able to make certain adjustments or corrections until later on in the process after we may have done some analysis of the data. So that we can better understand what needs to be transformed and how. So there may be some obvious problems to be fixed, and others that require a little bit more research. Now there's several different reasons why we might want to transform the data, but in pretty much every case, the end-goal is to make it more suitable for use in data science. Now the two most common tasks that we're going to perform with data transformation would be preparing the data, and then cleaning it. So data preparation involves altering the data so that it will more effectively work along with our data science and machine learning algorithms. So the goal is to improve our ability to analyze the data, and to build mathematical models that can be used to create estimations. In order to have an effective model, we need to do this cleaning so that, of course, it's going to be able to make accurate, good quality estimations. Now there's many different tasks that may go into preparing the data. But overall, our basic purpose is to try to identify and correct issues before we get to the point where we're actually loading the data into its final destination. And the issues that we find that could be just within individual values, within fields, or it could be more of an issue with the overall structure or format of certain features as well. So we can also, of course, perform data cleaning. Now, data cleaning really is just a subset of preparation. And in it, we are trying to deal with any inaccuracies and problems that exist in the data. Now, this could include many different things. It could be duplicated data, maybe the data is stored in the wrong data type, we need a floating point value and it's stored as an integer. Maybe the formatting is wrong. The dates are stored in the incorrect format. We could have corrupted data, there may be missing data and so on. And so when we talk about data cleaning, we'd be trying to address all those issues, and it would involve perhaps changing the offending data. Or maybe just removing a record if it's too difficult to repair the missing or corrupted values that may be there. Now either approach could be taken, and there's not always one right way to do this. But depending upon the situation, it may be either better to try to input a value and either correct or replace the missing value. Or in other cases, may be better to just simply drop the record with those issues. So if we see that perhaps in a dataset, we have a particular feature that has a lot of the same incorrect value in it, it may be fairly easy to correct that problem for all of those records once we've identified the correct value. But on the other hand, if the problematic values don't follow any kind of identifiable pattern, then that would make it more difficult to correct them. And of course, dropping records might be okay if you've got a large enough dataset with enough records in it. But if dropping records is going to reduce the amount of information our model has to be able to be trained in a significant way, then that probably is also going to not be desired. So because of this, we have to try to determine which approach is more feasible. And we're also going to look at trying to minimize any negative effects that are created by adjusting correcting, replacing values, or just simply perhaps dropping the record that contains the errant values. Now, the process of cleaning data can take a lot of time. And in fact, as we look at data preparation, often it's the most time consuming task, at least from a human perspective, when we're involved in data science. In fact, because of the scale of the task and the challenge at hand, sometimes it's referred to his data wrangling or data munging. And of course, when you're doing these tasks manually, there's a lot of work involved. Thankfully, there are software libraries that have functions that can help to automate many of these tasks. And of course, those are going to be very valuable when you may need to repeat the cleanup process, perhaps when you receive new data, or maybe for another project in another dataset. But remember as well, that as we're going through all of this data preparation and cleaning, that we're making changes to the data that may not be reversible. And so, in case you do something that creates an unwanted side effect, it's always a good idea to make sure you've got a backup copy of your data before you start to modify it.