Sometimes I came across a question “Is Apache Spark going to replace Hadoop MapReduce?“. It depends based on your use cases. Here I tried to explained features of Apache Spark and Hadoop MapReduce as data processing. I hope this blog post will help to answer some of your questions which might have coming to your mind these days.
Apache Spark stores data in-memory i.e. it first dump the whole data into memory and then process it; whereas Hadoop MapReduce stores data on disk.
Spark is memory (RAM) based computing (partial use of disk), MapReduce is disk based computing (partial use of memory).
Spark performs better when all the data fits in the memory, especially on dedicated clusters; MapReduce is designed for data that doesn’t fit in the memory.
Spark support interactive query; MapReduce is for batch only.
If Spark process fail in the middle of execution, will have to start processing from the beginning; whereas MapReduce relies on disk, if a process fail in the middle of execution, it could continue where it left off.
Spark uses Resilient Distributed Datasets (RDD) for fault tolerance; MapReduce uses replication to achieve fault tolerance. Replication is the costly process as it uses disk I/O and network bandwidth.
Spark is best suited for iterative processing that need to process over the same data many times; but when it comes to one-pass ETL-like (each time with new data set) jobs, for example, data transformation or data integration, then MapReduce is the deal.
MapReduce is inefficient for applications that repeatedly reuse a same set of data. Spark keep working sets in memory for efficient reuse (e.g. caching data in RAM).
Spark is very efficient for iterative computations (iterative computations that need to pass over the same data many times). For instance, ETL type computations where result sets are large and may exceed aggregate RAM of the cluster by an order of magnitude? Hadoop MapReduce is likely to outperform Spark in this case.
Apache Spark comes with some backlogs such as inability to handle in case if the intermediate data is greater than the memory size of the node, problems in case of node failure and the most important of all is the cost factor. RAM prices being 5USD per GB, we can have near about 1TB of RAM for 5K USD thus making memory to be a very minor fraction of the overall node costing.
Spark processes exists into memory and keeps it there until further notice. MapReduce, however, kills its processes as soon as a job is done.
Nonetheless, because MapReduce relies on hard drives, if a process crashes in the middle of execution, it could continue where it left off, whereas Spark will have to start processing from the beginning. This can save time.