Now one of the most critical parts of data analysis happens way before you build your first visualization or your machine learning models. And while not necessarily the most glamorous ,data preparation and transformation can't be ignored. In this module, we're going to cover the elements of what makes a dataset good and then look at two different ways to process, clean and prepare your data. One, as you might expect, is through SQL and BigQuery. And the other is through a graphical user interface in Cloud Dataprep. So the majority of data quality lectures will begin with a statement like this: Garbage in, garbage out. Now what that means in our specific examples are: if you want to build amazing machine learning models, you can't feed them with garbage. You've got to feed them with really good data to begin with. And if again, if that is your end goal, to build the awesome ML models after you master this course, and the data engineering course after this, then this is your real first step there. You have to get extremely good, clean data in order for something like an ML model, or even just insights from SQL, to begin to even process things at that caliber, because ML specifically relies on consistent patterns of data. And if you start feeding it garbage data, it's going to, you know, potentially overfit your models and do a whole bunch of other bad things that you don't want to do. So in order to prevent that, let's talk about some of the things that we can do. So high quality data sets follow these strict rules, and there are five of them. They're valid, they conform to those business rules. They're accurate, again some true objective value. They're complete, meaning that you can see the whole picture and you're not just getting a subset of the data, not being fooled. They're consistent, meaning that you can drive actual insights that you can rely off of. And they're uniform, and there's a fun example that we'll go through for each of these. First up, validity. Data follows constraints on uniqueness. So what do three of these images have in common? They're all unique identifiers, right? So that telephone represents a unique number where I can reach you, you can reach me. If somebody else had my same number, I'd be worried, I don't even know how that would work. You would call me and the two people would answer it? It just wouldn't work, right. And for the same principle that that sounds ludicrous, having multiple physical addresses be the same, or people sharing the same license plate presents a massive problem. And in the same way for your data, when we talk later on in the courses about joining data together, or like we did earlier about having the EIN, or the charitable organization that had multiple filings within the same year. That'll present complications, like we talked about, like when you're summing things and making assumptions about all the different individual records, if they're not unique, you could run into some serious complications, especially when we talk about joins later and unintentionally cross joining your data between the cover. Second up, valid data corresponds to constraints of ranges. So here we go. Which values are out of range? So you've got some dice and you've got the rolls here, the occurrences. If you're staring at number seven, that could potentially be right. So say you're rolling one die, and is it possible to get a seven? No, it's not possible to get a seven, that's out of bounds. Alternatively, say since I did mention dice, you're rolling two dice. What would be the out of range value here? That would be roll number five with a value of one, if you roll two dice, can't have one as one of those values. So the metapoint here is that if you're able to know ahead of time if your data smells funny, or if after exploring your data in the exploratory analysis that you perform, you find that hey, this should always be X or this should always be Y, and it falls in one of these ranges, you can set those up much like they do in software development as a test, and you can do that within SQL, you could do that within one of the other tools I'm going to show you called Cloud Dataprep. Any of these values that are making your dataset weird, or cause you pause, you can have automatic notifiers that basically say, hey, you know, this, this dirty data right here, I'm going to go ahead and clean it up and make it transform. So in this particular case, you could say, well, if it's one, I want to exclude that complete value. I don't even want to include that roll at all, or maybe I'll completely invalidate the results because there's something obviously wrong with the dice here. So there's many things that you can do, but setting up the validation in an automated, repeatable way is one of the end results here. Moving on, data has to match a source of truth, right? That's what makes it accurate. Say if you're in the United States and you're presented with this list. Washington, Oregon, California, hot dogs, Florida and Maine, and you're like, wait a minute, I don't remember a US state called Hot Dog. It doesn't match what we know as a source of truth. So it's a string it matches in terms of data value, but against a known lookup source, it doesn't necessarily match. And again, there's debates over, as you're going to see in your Cloud Dataprep lab, how many U.S. states are there? Are there 50, as is commonly known, or are you also including territories, U.S. territories or, you know, post offices that exist potentially in other other countries. You'll see that come up again soon. What about this? So we're going to talk about completeness, thanks in large part to the ubiquity of probably the image on the right, you'll be able to discern that is actually Big Ben. But here you're comparing a subset, or a sample of your data in trying to make insights or assumptions about the entirety of your data. So from here, going back to the last slide. You could say, oh, my dataset is a bunch of just clocks and lamps, and they look funny, and they're low resolution. But when in reality, the greater picture of your dataset here, because pixels are just numbers right, is much, much more complex and it's actually oh, no, it's this location, this is this is Big Ben. This is in London. So it's funny how we can immediately notice something that's wrong with a picture. A picture is worth a thousand words, but within a dataset, if someone hands you a CSV file or a bunch of JSON, then we generally just, unless we're practicing to, we don't ask the question of is this all the data? Is this a subset, am I missing fields, or rows, or columns? Is this complete or do I need to do additional work to make sure that I'm collecting everything that I need to collect? Especially true when you're mashing up data from a variety of different silos, right? So don't be afraid to ask those questions. Is my data complete? Do I have everything? Are all the rows accounted for? Well, last but not least, we have consistency. And this is the dilemma here, where you have potentially two owners for a house. You have two different tables. The house address, 123 ABC Street, and the owner ID of 12. And you have an owner's table, where it has Owner ID of 15, but their address is 123 ABC Street. In Database 101, we call this a referential integrity issue. So who actually owns that house, right? Whose name is on the deed. And having consistent data, and we call this one fact in one place. It gives you that harmony across many different systems, especially when it comes to joining the state later. You don't have any issues. Oh, I lied. It wasn't the last one. Uniformity was the last one, because this is the example that I really like to share. So there was a Mars climate orbiter in just before the year 2000, right? In 1999, NASA unfortunately lost $125 million climate orbiter as it burned up into the surface of Mars because of an issue with one team using the English system of feet, and yards, and another team that was working on a different part of this system using the metric system. And while it seemed like a potential innocuous problem to begin with, like everyone's just, you know, English versus metric, who's right? At the end of the day when it amounts to building such a complex system like this, you need to make sure that everyone is using a uniform system of measurement.