[SOUND] In this lesson we are going to cover Apache Spark. Before talking about Spark though, I would like to talk about the motivation behind Spark and the sort of computations that it tries to cover. And it's not the only framework that does this sort of thing. I'll mention a bunch of others, but Spark is the big one these days. The goal of Spark is to cover iterative algorithms or interactive data explorations. These are things that are commonly used in a lot of domains, data sciences, IT today, scientific computing does a lot of data exploration. Iterative algorithms we can find in many, many domains, and we want to be able to cover decent vanguard. Now, traditional MapReduce, and by traditional I mean, things that we talked about in our second week of lectures, Hadoop, MapReduce, those sorts of stuff. Those sorts of frameworks, they cannot efficiently solve Iterative algorithms. Now, the question is why? Consider something like Hadoop. If you want to write an iterative algorithm, what do I mean by iterative? Same algorithm ran once, and then on the same data set, or almost the same data set, again, and again, and again, until you reach an answer. If you write this on top of Hadoop, what you will end up doing is repeated access to data that is stored on HDFS. That means repeated access in each iteration to hard drives, to physical slow high latency devices. And Hadoop doesn't provide any optimization for caching the data. It says, this is one MapReduce job, starts from HDFS, ends in HDFS. Doesn't support any sort of data caching in between iterations. I'll talk about this a little bit more in my upcoming slide. Now let's say you take just bare bones MPI. There's no natural support for fault tolerance in a program within an MPI so the programming interface is quite complicated. The programmer meaning you, have to consider all of the implications of writing the application whether it's reliability. Whether it's a fault tolerance, all sorts of a crazy things that you the program has to consider. So, none of these traditional frameworks handle this class of algorithms in a good way. So if you try to go ahead and basically retrofit iterations on top of MapReduce, well out of the box it doesn't support iterations. But well, we still want iterative algorithms. What was one iterative algorithm that was quite famous, we covered in one of our lectures before? Page Rank, right? Page Rank you remember, there was two different MapReduce jobs. There was the propagation phase, and then there was the aggregation phase. And you would run the first job, second job, again first job, second job, first job, second job, you would keep doing that. And that was actually the first use case for Hadoop for MapReduce itself, right. because remember, Hadoop actually came out of the Notch project which was the doing that, the Page Rank algorithm. So iterative algorithms have been retrofitted on top of MapReduce frameworks. So, as I mentioned Nutch, the indexing and search engine, Apache Nutch, or Apache Mahout, does a lot of iterative computations. Now, the obvious solution, if you want to implement this on top of MapReduce is the split iteration into multiple MapReduce jobs. Write a driver program for orchestration. One job, another job, another job. Each job is the same thing really just with a slightly modified input data set. Why slightly modified? Because the last iteration slightly modified it cut it close or a little bit to the answer. So let's take a look at what would be involved in running a iterative algorithm on top of MapReduce. So the sort of algorithm that you're talking about you can see here. We have inside the basic loop we iterate inside this iteration we do say a Map phase and a Reduce phase together becoming a MapReduce job. In the map, basically what you do is you run a certain computation for all implementative elements and then it reduce you to a reduction. And you basically repeat this until convergence is achieved. So, job one, job two, job three, all the same. And basically what happens is that, in each iteration you use the dataset. Inside your mapper functions. And come up with a slightly modified answer and hopefully a little bit closer to the final answer, and then you write back the answer on repeat. And there are many problems in this model. You have repeated reads of constant data, basically data that doesn't change any in each iteration. Typically it's the larger of the two data types. So you have constant data type and you have small amount of data that changes from every iteration, file iteration. Each time you have to read the constant data sets again and again. For each iteration you have one MapReduce job. So there is overhead in launching every MapReduce job. If you've, in our previous weeks, when you were running the programming assignments, if you were adventurous and just tried to see what happens if you have an empty map job and an empty reduce job. You would have see that, even with no computation, when you launch a MapReduce job in Hadoop, it takes 15, 20 seconds to do nothing, basically. So when you have iterative set of MapReduce jobs In each iteration, you have just twenty seconds of nothingness happening. Well, of course, there are lots of things happening for the framework. But for the user's computation, nothing happens. So, aside from that, when you have MapReduce, there's a lot of intermediate communication, remember, there is a shuffle and sort phase. And there's a lot of communication that puts a lot of traffic on the network subsystem of a cluster. And each time around you do almost the same thing again and again. So, now that you've seen all of the issues in implementing an iterative computation, or an interactive computation for that matter on top of MapReduce or frame work like MapReduce, Hadoop, Yarn. Lets see what we have for solving these issues. In the next video I will introduce the Apache Spark project that allows you to do lots of cool things with interactive and iterative algorithms. [MUSIC]