Next, we'll discuss how and why you should consider processing your same Hadoop job code in the Cloud using Dataproc on Google Cloud. Dataproc lets you take advantage of open source data tools for batch processing, querying, streaming and machine learning. Dataproc automation helps you create clusters quickly, manage them easily and save money by turning clusters off when you don't need them. When compared to traditional on premises products and competing cloud services, Dataproc has unique advantages for clusters of three to hundreds of nodes. There is no need to learn new tools or APIs to use Dataproc, making it easy to move existing projects into Dataproc without redevelopment. Spark, Hadoop, Pig and Hive are frequently updated. Here are some of the key features of Dataproc. Low cost Dataproc is priced at one cent per virtual CPU per cluster per hour on top of the other Google Cloud resources you use. In addition, data clusters can include preemptive instances that have lower compute prices. You use and pay for things only when you need them. So Dataproc charges second by second billing with a one minute minimum billing period. Super-fast, Dataproc clusters are quick to start, scale and shut down, with each of these operations taking 90 seconds or less on average. Resizable clusters, clusters can be created and scaled quickly with a variety of virtual machine types, dissizes, number of nodes and networking options. Open source ecosystem, you can use spark and Hadoop tools, libraries and documentation with Dataproc. Dataproc provides frequent updates to native versions of Spark, Hadoop, Pig and Hive. So there is no need to learn new tools or APIs, and it is possible to move existing projects or ETL pipelines without redevelopment. Integrated, built in integration with cloud storage, big query and cloud big table, ensures data will not be lost. This together with cloud logging and cloud monitoring provides a complete data platform and not just a Spark or Hadoop cluster. For example, you can use Dataproc to effortlessly ETL terabytes of raw log data directly into big query for business reporting. Managed, easily interact with Clusters and Spark or Hadoop jobs without the assistance of an administrator or special software through the Cloud Console, the cloud SDK or the Dataproc REST API. When you're done with the cluster, simply turn it off, so many isn't spent on an idle cluster. Versioning, image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop and other tools. Highly available, run clusters with multiple primary nodes and set jobs to restart on failure to ensure your clusters and jobs are highly available. Developer tools, multiple ways to manage a cluster including the Cloud Console, the Cloud SDK, restful APIs and SSH Access. Initialization actions, run initialization actions to install or customize the settings and libraries you need when your cluster is create. And automatic or manual configuration, Dataproc automatically configures hardware and software on clusters for you while also allowing for manual control data. Dataproc has two ways to customize clusters, optional components and initialization actions. Preconfigured optional components can be selected when deploying from the Console or via the command line and include Anaconda, Hive, Web H Cat, Jupiter notebook, Zeppelin notebook, Druid, Presto and Zookeeper. Initialization actions let you customize your cluster by specifying executables or scripts that Dataproc will run on all nodes in your data cluster immediately after the cluster is set up. Here's an example of how you can create a Dataproc cluster using the Cloud SDK. And we're going to specify a H based shell script to run on the clusters initialization. There are a lot of prebuilt startup scripts that you can leverage for common Hadoop cluster setup tasks like Flink, Jupiter and more. You can check out the GitHub repo link in the course resources to learn more. Do you see the additional parameter for the number of master and worker nodes in the script? Let's talk more about the architecture of the cluster. A Dataproc cluster can contain either preemptable secondary workers or non preemptable secondary workers but not both. The standard setup architecture is much like you would expect on premise. You have a cluster of virtual machines for processing and then persistent disks for storage via HDFS. You've also got your primary node VMS in a set of worker nodes. Worker nodes can also be part of a managed instance group, which is just another way of ensuring that VMS within that group are all of the same template. The advantage is that you can spin up more VMS than you need to automatically resize your cluster based on the demands. It also only takes a few minutes to upgrade or downgrade your cluster. Generally, you shouldn't think of a Dataproc cluster as long lived. Instead, you should spin them up when you need compute processing for a job, and then simply turn them down. You can also persist them indefinitely if you want to. What happens to HDFS storage on disk when you turn those clusters down, the storage will go away too which is why it's a best practice to use storage that's off cluster by connecting to other Google Cloud products. Instead of using native HDFS on a cluster, you could simply use a cluster of buckets on Cloud storage via the HDFS connector. It's pretty easy to adapt existing Hadoop code to use Cloud storage instead of HDFS. Change the prefix for this storage from HDFS forward slash forward slash two gs forward slash forward slash. What about H base off cluster? Consider riding in the cloud big table instead. What about large analytical workloads? Consider reading that data into big query and doing those analytical workloads there. Using Dataproc involves this sequence of events. Set up configuration, optimization, utilization and monitoring. Setup means creating a cluster and you can do that through the Cloud Console or from the command line using the gcloud command. You can also export a YAML file from an existing cluster or create a cluster from YAML file. You can create a cluster from a terraform configuration, or use the REST API. The cluster can be set as a single VM, which is usually to keep costs down for development and experimentation. Standard is with a single primary node and high availability has three primary nodes. You can choose a region and zone or select a global region or allow the service to choose the zone for you. The cluster defaults to a global endpoint, but defining a regional endpoint may offer increased isolation and in certain cases lower latency. The primary node is where the HDFS name node runs as well as the yarn node and job drivers. HDFS for application defaults to two in Dataproc. Optional components from the Hadoop ecosystem include Anaconda, Aython distribution and package manager, Hive web cat, Jupiter notebook and Zeppelin notebook. Cluster properties are runtime values that can be used by configuration files for more dynamic startup options, and user labels can be used to tag the cluster for your own solutions or reporting purposes. The primary node, worker nodes and preemptable worker nodes if enabled, have separate VM options such as VCPU, memory and storage. Preemptable nodes include node manager but they do not run HDFS. There are a minimum number of worker nodes. The default is two, the maximum number of worker nodes is determined by a quota and the number of SSDS attached to each worker. You can also specify initialization actions, such as initialization scripts that can further customize the worker nodes. And metadata can be defined so that the VMS can share a state information. Preemptable VMs can be used to lower costs. Just remember that they can be pulled from service at any time and within 24 hours. So your application might need to be designed for resilience to prevent data loss. Custom machine types allow you to specify the balance of memory and CPU to tune the VM to the load so you're not wasting resources. A custom image can be used to pre install software so that it takes less time for the customized note to become operational than if you install the software a boot time using an initialization script. You can also use a persistent SSD boot disk for faster cluster startup. Jobs can be submitted through Console, the gcloud command or the REST API. They can also be started by orchestration services such as Dataproc Workflow and Cloud Composer. Don't use Hadoop direct interfaces to submit jobs because the metadata will not be available to Dataproc for job and cluster management. And for security they are disabled by default. By default, jobs are not restartable, however, you can create restartable jobs through the command line or REST API. Restartable jobs must be designed to be item potent and to detect successor ship and restore state. Lastly, after you submit your job, you'll want to monitor it. You can do so using cloud monitoring, or you can also build a custom dashboard with graphs and set up monitoring of alert policies to send emails for example, where you can notify if incidents happen. Any details from HDFS, YARN, Metrics about a particular job or overall metrics for the cluster like CPU utilization, disk and network usage can all be monitored and alerted on with Cloud monitoring.