Hi, I'm Evan Jones a Technical Curriculum Developer here with Google. Welcome to the chapter on building a data lake. Let's start with a discussion about what data lakes are, and then where they fit in as a critical component to your overall data engineering ecosystem. So what is a data lake? Well it's a fairly broad term, but it generally describes a place where you can securely store various types of data of all scales for processing and analytics. Data lakes are typically used to drive data analytics, data science and ML workloads, or batch and streaming pipelines. Data lakes will accept all types of data finally, data links are portable onpremise or in the cloud. Now here is where data lakes fit into the overall data engineering ecosystem for your team. You have to start with some originating system or systems that are the source of all of your data. Those are your data sources, then as a data engineer, you need to build those reliable ways of retrieving and storing that data. Those are your data sinks, the first line of defense in an enterprise data environment is your data lake. Again, it's the central give me whatever data you have in a variety of formats, volume, and velocity and I got it, and I'll take care of it. We'll cover the key considerations and options for building a data lake in this module. Once your data is off the source systems and inside of your environment, generally a ton of cleanup and processing is required to massage that data into a useful format for your business. It will then likely end up in your data warehouse, that's the focus for our next module. What actually performs the cleanup and processing of data? Those are your data pipelines, they're responsible for doing the transformations and processing on your data at scale, and bring your entire system to life with fresh newly processed data available for analysis. Now an additional abstraction layer above your pipelines is what I like to call the entire workflow. You'll often need to coordinate efforts between many of the different components at a regular or an event driven cadence. While your data pipeline may process data from your lake to your warehouse, your overall orchestration workflow. Maybe the one responsible for kicking off that data pipeline in the first place, when it notice that there was a new raw data file available from one of your sources. Before we move into what cloud products fit into one of these roles, I want to leave you with an analogy that has help me disambiguate each of these components. All right, just for a second take off your data engineering hat and put on a civil engineering hat for a moment. You're tasked with building an amazing skyscraper in a downtown city somewhere. Before you break ground, you need to ensure that you have all the raw materials that you're going to need to achieve your end objective. Sure, some materials can be sourced later inside the project, but let's keep this example simple and say you have all the materials on your job site to begin with. The act of bringing the steel, the concrete, the water, the wood, the sand, the glass. Whatever it is from whichever sources elsewhere in the city, onto your construction site is analogous to data coming from all those different source systems and into your data lake. Great, now you have a ton of raw materials, but you can't use them as is to build your building. You've gotta cut the wood, corrugate the metal, measure and cut the glass and format everything before it's suited for the purpose of building your building. The end result that cut glass, shaped metal, that's the formatted data that's then stored inside of your data warehouse. It's ready to be used to directly add value to your business, which in our analogy is actually building the building itself. How do you transform these raw materials into those useful pieces, while continuing with the analogy on the construction site? That's the job of the worker, as you'll see later when we talk about data pipelines, it's actually pretty funny. The individual unit behind the scenes is literally called a worker on cloud dataflow, and a worker is actually just a virtual machine. And it takes some small piece of data and it transforms that piece for you. And you might be asking, what about the building itself? Well, that's whatever end goal or goals that you have for this engineering project. In the data engineering world, the shiny new building could be a brand new analytical Insight that wasn't possible before, or a machine learning model. Or whatever else you want to achieve now that you have that cleaned data available. The last piece of the analogy is the orchestration layer, on a construction site, you generally have a manager or supervisor that directs when work is to be done. And if there are any dependencies they could say hey, once that new metal gets here, send it to this area of the site for cutting and shaping, and then alert this other team that it's available for them to start building with. In the data engineering world, that's your orchestration layer or your overall workflow. So you might say, every time a piece of a CSV file any of that data drops into this Google Cloud Storage bucket. I want you to automatically pass it to our data pipeline for processing, and once it's done processing I want you the data pipeline to then stream it into our data warehouse. And then we're not done yet, once it's in the data warehouse, I will notify the machine learning model that new cleaned training data is now available for training and retraining. And then I can direct it to start training a new model version. Can you see mentally the graph of actions that we're building out here? What if one step fails, or what if you want to run this every day or every hour or a triggered on an event? You're beginning to see the need for an orchestrator which in our solutioning, will be Apache airflow running on a cloud composed environment later. Let's bring back one example solution architecture diagram that you saw earlier in the course. The data lake here is Google Cloud Storage buckets, right in the center of that diagram. It's your consolidated location for raw data, and it's durable and highly available. In this example our data lake is those Google Cloud Storage buckets, but that does not mean that Google Cloud Storage is your only option for data legs on GCP. I'll say it again, cloud storage is one of a few good options to serve as a data like but it's not the only one. In other examples we'll look at, BigQuery maybe your data lake and your data warehouse, and you're not using Google Cloud Storage buckets at all. This is why it's so important to first understand what you want to do first, and then finding which of the solutions best meets your needs. Regardless about which tools and technologies you use with the cloud, your data like generally serves as that single consolidated place for all of your raw data. I like to think of it as a durable staging area, everything gets collected here and then sent out elsewhere. Now this data may end up many other different places like a transformation pipeline that cleans it up, and moves it to the data warehouse. And then it's read by a machine learning model, but it all starts with getting that data into your data lake first. And now let's do a quick overview of some of the core Google Cloud Big Data products that you need to know as a data engineer, and that you'll get hands-on practice with those inside of your labs later. Here is a list of the big data in ML products organized by where you will likely find them, in a typical big data processing workflow. From storing the data on the left, to ingesting it into your cloud native tools for analysis, training machine learning models and ultimately serving up some kind of insights. In this data lake module, we'll focus on two of the foundational storage products which make up your data link, Google Cloud Storage and Cloud SQL if you're using relational data. Later on in the course, you will practice with Cloud Bigtable as well if you want to do high throughput streaming pipelines. You may have been surprised as I was, when I first started learning Google Cloud platform to not see BigQuery in the storage column, generally BigQuery is uses a data warehouse. So let's remind ourselves, what's the core difference between the data lake and a data warehouse then? A data lake is essentially that place where you've captured every aspect of your business's operations raw. Because you want to capture every aspect you tend to store the data in its natural raw format, and that's the format is thrown out by your applications. You might have a log file and all those log files and other raw data files, all will get jammed together inside of your data Lake. You can basically store anything that you want and in any format and flexibility that you want. So you tend to store things like object blobs or files, the advantage of this data lakes flexibility as the central collection point, is also the problem. With a data lake, the data format is very much driven by the application that writes the data in, and it's whatever format that ends up as. And the advantage of the data like is that whatever the application gets upgraded, it can start writing the new data immediately, because it's just a capture of whatever raw data exists. But, how do you take this flexible and a large amount of raw data and then ultimately do something useful with it for your business? Enter the data warehouse, so on the other hand the data warehouse is much more thoughtful than a data lake. You might load the data into the data warehouse only after you have a schema clearly defined, and a use case identified, and so no garbage data in the data warehouse. You might take the raw data that exists in the data lake, transform it, organize it, process it, clean it up, and then store it as immediately useful data inside of your warehouse. Why are you getting the data warehouse? Well, maybe because the data in the data warehouse is used to generate charts, reports, dashboards as a back-end for your machine learning models. Whatever the use case for your business is that data is immediately able to be used from the warehouse. Now the idea is that because the schema is consistent and shared across all the applications, someone like a data scientists or data analyst could go right in and derive insights much, much faster. So a data warehouse tends to be structured and semi-structured data that's organized and placed into a format that makes it conducive for immediate querying and analysis.