Is Hadoop Obsolete?

Let us first understand What is Hadoop, MapReduce, Big Data Framework and Spark

Hadoop is an Open Source framework for writing applications to processes structured, semi-structured or unstructured data that are stored in HDFS. Essentially Haddop is a distributed data infrastructure: Hadoop MapReduce is designed  to distribute massive data collections across multiple nodes within a cluster of commodity servers, which means you don’t need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously.

Hadoop is Scalable, and Fault tolerant framework written in Java. Hadoop is not only a storage system but is a platform for large data storage as well as processing.

Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn’t do distributed storage.

Apache Spark is an Open Source Big Data framework. It is faster and more general purpose data processing engine and is basically designed for fast computation. It covers a wide range of workloads such as batch, interactive, iterative and streaming.

So essentially Spark is a In-Memory Cluster Processing Framework while MapReduce involves multiple read-writes on disks.

Spark performs better than Hadoop when:

  1. data size ranges from GBs to PBs
  2. there is a varying algorithmic complexity, from ETL to SQL to machine learning
  3. low-latency streaming jobs to long batch jobs
  4. processing data regardless of storage medium, be it disks, SSDs, or memory

But if the size of the data is small (~100 MB) Hadoop can sometimes be faster when performing mapping in the data nodes.

Hadoop is used for Batch processing whereas Spark can be used for both. In this regard, Hadoop users can process using MapReduce tasks where batch processing is required. In theory, Spark can perform everything that Hadoop can and more. Thus it becomes a matter of comfort when it comes to choosing Hadoop or Spark.

Lets do some point wise comparison:

1.    Speed:

Apache Spark – Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible.

Hadoop MapReduce – MapReduce reads and writes from disk, as a result, it slows down the processing speed.

2.    Difficulty:

Apache Spark – Spark is easy to program as it has tons of high-level operators with RDD – Resilient Distributed Dataset.

Hadoop MapReduce – In MapReduce, developers need to hand code each and every operation which makes it very difficult to work.

3.    Easy to Manage

Apache Spark – Spark is capable of performing batch, interactive and Machine Learning and Streaming all in the same cluster. As a result makes it a complete data analytics engine. Thus, no need to manage different component for each need. Installing Spark on a cluster will be enough to handle all the requirements.

Hadoop MapReduce – As MapReduce only provides the batch engine. Hence, we are dependent on different engines. For example- Storm, Giraph, Impala, etc. for other requirements. So, it is very difficult to manage many components.

4.    Real-time analysis

Apache Spark – It can process real time data i.e. data coming from the real-time event streams at the rate of millions of events per second, e.g. Twitter data for instance or Facebook sharing/posting. Spark’s strength is the ability to process live streams efficiently.

Hadoop MapReduce – MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data.

5.    Fault tolerance

Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart the application from scratch in case of any failure.

Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant, so there is no need to restart the application from scratch in case of any failure.

6.    Security

Apache Spark – Spark is little less secure in comparison to MapReduce because it supports the only authentication through shared secret password authentication.

Hadoop MapReduce – Apache Hadoop MapReduce is more secure because of Kerberos and it also supports Access Control Lists (ACLs) which are a traditional file permission model.

Hadoop Structure

Hadoop is a Big Data tool which has three major layers:

  1. HDFS – Storage Layer
  2. Mapreduce – Processing Layer
  3. Yarn – Resource Management Layer (Yet Another Resource Negotiator)

Spark on the other hand is another Processing Layer. It uses HDFS as storage layer and Yarn for Resource management. Spark does not have it’s Storage layer and it is dependent on third party i.e. HDFS.

Spark1

Spark2

Hadoop is still the Backbone of Big Data and in Good demand.

Spark is the next stage in the evolution of this. The fundamental thinking is that fine grained mutable state is a very low level abstraction and building block for ML algorithms ; Hence Spark was an attempt to raise this abstraction to coarse grained immutable data called RDD’s ( Resilient DIstributed DataSets) ;

Spark3

Since HDFS never really supported multiple writer concurrent appends anyway , it follows that RDD’s are not giving up much by being immutable – whereas you gain a lot by  having both immutability and a higher level of abstraction to begin with for big data.

Use-cases where Spark fits best:

Real-Time Big Data Analysis:

Real-time data analysis means processing data generated by the real-time event streams coming in at the rate of millions of events per second, Twitter data for instance. The strength of Spark lies in its abilities to support streaming of data along with distributed processing. This is a useful combination that delivers near real-time processing of data. MapReduce is handicapped of such an advantage as it was designed to perform batch cum distributed processing on large amounts of data. Real-time data can still be processed on MapReduce but its speed is nowhere close to that of Spark.

Spark claims to process data 100x faster than MapReduce, while 10x faster with the disks.

Graph Processing:

Most graph processing algorithms like page rank perform multiple iterations over the same data and this requires a message passing mechanism. We need to program MapReduce explicitly to handle such multiple iterations over the same data. Roughly, it works like this: Read data from the disk and after a particular iteration, write results to the HDFS and then read data from the HDFS for next the iteration. This is very inefficient since it involves reading and writing data to the disk which involves heavy I/O operations and data replication across the cluster for fault tolerance. Also, each MapReduce iteration has very high latency, and the next iteration can begin only after the previous job has completely finished.

Also, message passing requires scores of neighboring nodes in order to evaluate the score of a particular node. These computations need messages from its neighbors (or data across multiple stages of the job), a mechanism that MapReduce lacks. Different graph processing tools such as Pregel and GraphLab were designed in order to address the need for an efficient platform for graph processing algorithms. These tools are fast and scalable, but are not efficient for creation and post-processing of these complex multi-stage algorithms.

Introduction of Apache Spark solved these problems to a great extent. Spark contains a graph computation library called GraphX which simplifies our life. In-memory computation along with in-built graph support improves the performance of the algorithm by a magnitude of one or two degrees over traditional MapReduce programs. Spark uses a combination of Netty and Akka for distributing messages throughout the executors. Let’s look at some statistics that depict the performance of the PageRank algorithm using Hadoop and Spark.

Iterative Machine Learning Algorithms:

Almost all machine learning algorithms work iteratively. As we have seen earlier, iterative algorithms involve I/O bottlenecks in the MapReduce implementations. MapReduce uses coarse-grained tasks (task-level parallelism) that are too heavy for iterative algorithms. Spark with the help of Mesos – a distributed system kernel, caches the intermediate dataset after each iteration and runs multiple iterations on this cached dataset which reduces the I/O and helps to run the algorithm faster in a fault tolerant manner.

Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverages iterations and yields better results than one pass approximations sometimes used on MapReduce.

Fast data processing. As we know, Spark allows in-memory processing. As a result, Spark is up to 100 times faster for data in RAM and up to 10 times for data in storage.

Iterative processing. Spark’s RDDs allow performing several map operations in memory, with no need to write interim data sets to a disk.

Near real-time processing. Spark is an excellent tool to provide immediate business insights. This is the reason why Spark is used in credit card’s streaming system.

It should be borne in mind that this answer leans heavily on fast, speed-of-thought analytics and ML perspective. Also, I will only consider scenarios that I actually observed people trying to use Spark for.

ETL+wrangling:

Do you have engineers that are capable of writing decent, most importantly, legible Scala? If no, probably stay with Hadoop. Unless there are other, non-technical goals (like a renewed marketing message).

Do you run, by any chance, interactive, selective, speed-of-thought queries (think OLAP)?

If yes, then no, Spark is not a spatial indexer. It can do it, but it will be locked to a “full table scan” solution, which means  it will do it, on average, at a 1000x higher cost than necessary, with the QPS 1000x lower than actually is feasible, for any average pivoting UI scenario.

Oh, but then, there are no true distributed, “big data” MOLAP solutions with a good distributed spatial index scanning in OSS domain today (no, e.g. Impala does not qualify as spatial scan engine), so… maybe; but try commercial vendors perhaps instead. What you gain in software license costs, you most likely will lose on hardware and programming effort tenfold, if you do not.

Either way, MapReduce cannot do it either.

If you only run queries that are always best optimized with a “full table scan” (i.e. low selectivity queries), or do ETL type of things only, sure, go ahead. SparkQL is pretty good for that and is fairly easy.

ML (Machine Learning):

Is the speed or interactivity of the ML computations important?

If yes, then most likely one needs to move beyond Spark. Spark, as far as numerical, iterative, shared-nothing platforms go, is about the slowest platform there is. Practically everything else that exists for that purpose (except for the Hadoop variety of MapReduce of course), has far better strong scaling properties than Spark; even in the free software realm.

There are two main problems with Spark that prevent it form being on par with the segment leaders for performance:  (a) a fine-grain, centralized, heavy-weight task scheduling, and (b) lack of efficient multicast programming model (as in MPI, GraphLab). This is illustrated, for example, here: Large Scale Machine Learning and Other Animals, and is true for any numerical solution that needs to iterate till convergence (which is almost everything there is). This may be a bit dated, but nothing has changed materially to date in this department.

This is especially bad with “super-step” architectures (i.e. GraphX vs. GraphLab).

If no, the described performance issues may be acceptable in your case upon evaluation; but keep in mind that there are much (much!) faster things around here that may collapse your hardware requirements potentially 100x times for a solution of equivalent volume and accuracy.

Are you designing your own ML algorithms?

If yes, you also probably need to move beyond Spark. Semantically, better choices for tensor math include BidMat/BidMach, SystemML and Mahout Samsara.

If you are going to do off-the-shelf work only, then maybe staying with Spark is fine iff your needs are completely covered with what exists today in MLLib or any other math that runs on Spark (there is more than one choice, by now).

Both ML + ETL:

Realistically, the only reason to switch to Spark rather than anything else  for any decision branch that includes ML component is when operational aspects trump performance requirements.

I.e. if you can say, “I am ok with my thing running for hours on dozens of nodes, while it really computes on couple of machines in the same time elsewhere, but i have single machine cluster deployment to care about for all of ETL, batch analytics, and slow-ish, mostly non-interactive ML”, then yes. Switch to Spark.

In either case, ML on Hadoop variety of the Mapreduce is dead. Has been for at least past 3 years.

Streaming maybe? Intra-day metric aggregators? Some online inference even?

Yes, sure. Spark; but it may be worth to look at Flink, since it is said to be a “true streaming” platform, whereas Spark is said to be a microbatch emulation of streaming. In  the end this distinction may not be significant to you enough though. (disclosure: I do not have a first hand experience with the Apache Flink).

That said, as others have mentioned, even if you switch to Spark, you do away with MR ecosystem (MapReduce, Pig, Hive etc.) but not some form of a Hadoop HDFS substrate. Which means, you’d still most likely be heavily married to one of Hadoop distributions out there.

So, the correct statement is MapReduce is being Replaced by Spark and not Hadoop as a whole. IF you are planning to learn Hadoop or make a future in it, please go ahead, it is not dying or dead. Big data is growing fast in leaps and bounds, it’s time to switch. Read also: The Digital World

For more information on Big Data consulting write to me atul@nexcen.in

 

Advertisements

Big Data Revolution

BigData1As we welcome the new year 2018, I clearly see the emerging trends leading to Big Data. Calling it a revolution as it has changed the whole perspective and paradigm to look and treat the data. With high speed internet becoming more and more handy with masses with FTTH (Fiber To The Home) technology coupled with falling prices of broadband data connection, AI, Augmented Intelligence with Machine Learning getting wider acceptance across Enterprise IT World, the apprehensions about data security getting busted and low cost of high speed and storage resulting in high volumes of data; all lead to growing demand of Big Data.

BigData2

Started with 4Vs (Volume, Velocity, Variety and Veracity) Big Data has come a long way and is still evolving, evolving fast, adding more and more tools and techniques into its ecosystem.  The misconceptions, the myths about Big Data limitations and usage are fast fading out and the world is gearing up to embrace it wholeheartedly.

Let’s talk about the myths about limitations of Big Data:

Myth 1:  Hadoop is only for batch processing. The fact is it does provides real time analytics and interacts well with other real time big data tools like Scala.

Myth 2:  Data Security is not enough. With the evolution, Big data has come a long way and the ecosystem has built enterprise grade security onto Hadoop platforms including Cloudera, Hortonworks, MapR etc. There are now excellent data governance capabilities

Myth 3:  Big Data is for unstructured data only. Big Data widely used for unstructured data, providing alternate strategies and solutions to store huge volumes of unstructured data, providing parallel processing to reduce storage and access time, but is equally good for structured data.

In fact many of these myths are limited to MapReduce usage but as the Big Data evolves, there are plenty of alternate technologies and tools become part of ecosystem. While MapR, a Java-based tool is powerful enough to chomp Big Data and flexible enough to allow for good progress doing so, the coding is anything other than easy. Giving below a few effective and poplular alternatives to MapR:

  1. PigLatin, originally developed by Yahoo to maximize productivity and accommodate a complex procedural data flow, eventually became an Apache project, and has characteristics that resemble both scripting languages (like Python and Pearl) and SQL. In fact, many of the operations look like SQL: load, sort, aggregate, group, join, etc. It just isn’t as limited as SQL. Pig allows for input from multiple databases and output into a single data set.
  2. Hive also looks a lot like SQL at first glance. It accepts SQL-like statements and uses those statements to output Java MapReduce code. It requires little in the way of actual programming, so it’s a useful tool for teams that don’t have high-level Java skills or have fewer programmers with which to produce code. Initially developed by the folks at Facebook, Hive is now an Apache project.
  3. Spark is widely been hailed as the end of MapReduce. Born in the AMPLab at the University of California in Berkley, Spark, unlike Pig and Hive, which are merely programming interfaces for the execution framework, replaces the execution framework of MapReduce entirely. Spark provides most effective memory and resource usage, almost 100 times faster the MapR. Spark also provides many features, including stream processing, data transfer, fast fault recovery, optimized scheduling, and a lot more.

These alternatives have effectively reduced an organization’s dependency on Java.

If you are a late entrant into Big Data space, one benefit is that you won’t have to waddle through all of the platforms and products that came and went during the early years.

Initially it was just Hadoop Distributed File System (HDFS) then came MapReduce, Yarn and a plethora of various products, some of which blossomed and became mature parts of the Hadoop ecosystem. Others petered out or are still puttering around wondering what they’re going to be when they grow up.

BigDataEcosystem

Today, there are quite a number of Big Data products and platforms to pick from to assemble an infrastructure that meets your needs.

Giving below a few significant players, who have made a niche in the Big Data space:

Tez:  A generalized data-flow programming framework, built on Hadoop YARN, Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

Hive:  A Datawarehouse infrastructure that provides data summarization, to query HDFS, Database / support SQL, for Structured DB only. Best for BI Apps.  It can be used independently as well,

PigLatin: A high level data flow language, to query HDFS, shell, can be used independently as well. Best used for ETL

HBase: Scalable distributed database that supports structured data storage for large tables, No Sql database, Database / NoSQL, can be used independently as well. Can have downtime. Speed is fast. Best used for NoSql database

Cassandra:  Scalable multi-master database with no single points of failure, Uses No Sql database, Database / NoSQL, can be used independently as well. Limitation of slow read and write but highly scalable and high availability , Best used for Large/Sensitive NoSQL database.

Mahout:  Scalable machine learning and data mining library, Machine Learning, ML programming framework.

Oozie:  Workflow scheduler system, it is used for managing Hadoop apache jobs, Workflow Management.

Sqoop:   Import/Export utility, to import/export data from Hadoop to various database.

Flume: Robust, mature and proven tool for streaming data. Used to import/export real time data.

Hue: Hadoop user Experience, It is an web interface that supports Hadoop and its ecosystem.

Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Framework can be used independently as well.

Scala: Language in which spark is written. Language, can be used independently as well. Can run unmodified hive queries on existing Hadoop deployment, An alternative to HQL.

Spark streaming: Spark module for performing streaming analytics, Enables analytical and interactive apps for live streaming data. It’s a faster alternative of Flume.

Mlib: Spark module for performing Machine Learning, Machine Learning libraries being built on top of Spark, ML programming framework, can be used independently as well. It is a replacement of Mahout from Hadoop,

GraphX: Graph Computation engine. Combine data parallel and graph parallel concepts, engine.

SparkR: Package for R language to enable R users to leverage spark power from R shell, To access Spark data from R. A package.

PySpark:, To access spark data from Python, It’s  a shell.

Julia: Analytical/Mathematical Language used for Data Modeling/Visualization. It’s a language, can be used independently as well, for structured and unstructured data.

R: Analytical/Mathematical Language used for Data Modeling/Visualization, It’s a language, can be used independently as well, for Structured Database only.

Python: A General Purpose language. Can be used for data modeling. Can be used independently as well, for structured/unstructured data.

While Hive, PigLatin, HBase, Cassandra, Mahout, Oozie, Sqoop, Flume, Hue are Hadoop components,  Many of these tools like R, Python and Julia are originated independently and now part of ecosystem.

For more information contact me at atul@nexcen.in.