Welcome back. This is the third course in the data analysts specialization. Here we're going to achieve advanced insights in BigQuery. For the first module, when we continue our journey in SQL by delving into a few of the more advanced concepts like statistical approximation functions and user-defined functions. Then we will explore how to break apart really complex data questions into step-by-step modular pieces in SQL by using common table expressions and sub-queries. I'll start by revisiting the SQL functions that we've covered so far. It's time to revisit some of those functions that we skipped over in previous courses as we move into more and more advanced SQL concepts. The three ones that are on the agenda are the statistical functions, which includes some pretty fun approximations. Analytical functions, if you've heard of things like window functions or partitioning in the query. Last but not least, UDFs or user-defined functions. They can be written in SQL or JavaScripts. Let's tackle them one at a time. First up, some pretty generic statistical functions that you can do on your data. If you imagine a standard deviation or a correlation analysis. This is a great example just to apply some basic statistics. Here we're looking at the IRS dataset and we're getting the standard deviation of the employee counts. Also for the second correlation, we're passing in. How correlated is the revenue that one of these charities brings in, two, the amount of functional expenses that it's actually spending on their charitable programs and offerings. That's the ultimate question is how correlated do you think those two are? The closer it is to one, naturally, that is going to be a very tight one-to-one correlation. Just running that SQL there. You'll see the value here, standard deviation for about 1,500 employees. Then you can see that revenue and expenses for these non-profit charities are very tightly correlated. Again, this is for all of those records in their 2015 file and dataset, with that 0.97, extremely highly correlated. That's enough for the basic statistical functions. Let's move into some more of the other worldly ones. Here's some approximate aggregation functions. If you wanted to approximate the amount of data that's in your dataset and you say you have an absurd amount of rows. What you can use is approximate count distinct. Much like you've seen before, with the count distinct there in that second line, you can get that exact count. But you can also get the approximation which can be good enough to get the job done. Now you thinking in the back of your head like, well, the count distinct ran in under five seconds for my dataset that we were doing in our labs. Why on earth would I ever want to use the approximate count distinct and get the exact count. Me coming from the finance background if we wanted everything down to the penny. But imagine you're in this scenario where you work at Google and you have to count all of the logs or all the user logins over a multi-year period of time. You're talking about petabytes of data and you don't have all the time in the world. You want to actually sacrifice a little bit of accuracy for speed in which you get your query back. Especially when you're talking about counting all the logs that are processed or counting all the ads that are served. Very popular to do that approximate count distinct. Let's look at an example here. This is an example from one of our BigQuery Developer Advocates. Steven, he's an amazing SQL writer. What we're going to invoke here is that approximate count distinct. The task that we have at hand is the Github dataset is public on BigQuery, and that includes all of the user logins and all the public code on Github. That's a very fun public dataset to play around with. Here what we're doing is we're counting the approximate number of logins as an approximate value. What we're looking at is to see it by year. Here you'll see those concepts that we covered in previous courses where you have the union table wildcard there, that's that asterisk from Steven and then the concatenation in that table suffix again, is one of those concepts that we covered in the merging datasets topic. We now have an approximate count of users. But what you might be asking in the back of your minds, is how accurate is it is that to like the actual count and say, we wanted to get a more realistic count of all of the users since the period of time. Because this data right here could include the same user that is logged in over multiple years. How do we parse that out? To do that, it's a little bit more of a complex query, probably the most complex one that you've seen to date as part of this specialization. Let's take a deep breath and just walk through it step by step, line by line. The first thing that you're going to notice is we're invoking what's called a common table expression all the way at the top. We're going to cover that width clause shortly. But in essence, you can think of that as a temporary table. It's basically saying anything in those parentheses for that width, Github use sketches, anything within there, that select statement. We're going to be storing that like a temporary table that we can use in a subsequent query. The reason why you do that again is to break apart a very complex question into multiple steps. You can do a processing layer on the data and then query those results within that same query. Those width statements or common table expressions as they're called are extremely useful. Let's get into the guts of the query. That first query, what we want do is we want to get the approximate number of logins. What we're going to do is invoke a function called the HyperLogLog for another approximation. What that's going to do is it's going to also estimate the amount of logins just using a different function that's available. There's a long white paper written on HyperLogLog. It's an externally available algorithm and it's used all the time in things like statistical estimation as well. Once we've done that, the great thing about this is we can actually estimate year over year over year over year. We're doing this for however many years of Github data that we have on the previous slide, it was like 2,000 to current, and we're getting an estimate year over year over year over year. In the HyperLogLog namespace, those are actually called sketches. We're getting a collection of sketches which in a sense are estimates of counts. In that second block of code at the bottom, we can actually add those sketches together, which is summing those up. You can actually do that through the HyperLogLog count.merge. That merges all those estimations together as an approximate number of those unique users. As you see in the last line there, that's from that countable expression, that Github, your sketches that we created earlier. There's a lot to cover, but at the end of the day it ended up with 11 million approximately unique Github users. Then a different query. We're not showing it here where we actually ran to get the exact count. We found that this approximation was 99.7 percent accurate. Again, here's the trade-off between whether or not you want to get a 100 percent accuracy and have your queries run a little bit longer. Or if you're fine with just executing the approximation functions. If you see these in encode along the way or you start writing these approximations and estimations on very large datasets, just know that those are out there and they're available for you to use. Feel free to read a blog article for a little bit more detail on that. Approximation functions are another tool in your toolkit for very, very large datasets. Feel free to get familiar with them and we'll come back to those width clause is a little bit more.