Big Data Revolution

BigData1As we welcome the new year 2018, I clearly see the emerging trends leading to Big Data. Calling it a revolution as it has changed the whole perspective and paradigm to look and treat the data. With high speed internet becoming more and more handy with masses with FTTH (Fiber To The Home) technology coupled with falling prices of broadband data connection, AI, Augmented Intelligence with Machine Learning getting wider acceptance across Enterprise IT World, the apprehensions about data security getting busted and low cost of high speed and storage resulting in high volumes of data; all lead to growing demand of Big Data.

BigData2

Started with 4Vs (Volume, Velocity, Variety and Veracity) Big Data has come a long way and is still evolving, evolving fast, adding more and more tools and techniques into its ecosystem.  The misconceptions, the myths about Big Data limitations and usage are fast fading out and the world is gearing up to embrace it wholeheartedly.

Let’s talk about the myths about limitations of Big Data:

Myth 1:  Hadoop is only for batch processing. The fact is it does provides real time analytics and interacts well with other real time big data tools like Scala.

Myth 2:  Data Security is not enough. With the evolution, Big data has come a long way and the ecosystem has built enterprise grade security onto Hadoop platforms including Cloudera, Hortonworks, MapR etc. There are now excellent data governance capabilities

Myth 3:  Big Data is for unstructured data only. Big Data widely used for unstructured data, providing alternate strategies and solutions to store huge volumes of unstructured data, providing parallel processing to reduce storage and access time, but is equally good for structured data.

In fact many of these myths are limited to MapReduce usage but as the Big Data evolves, there are plenty of alternate technologies and tools become part of ecosystem. While MapR, a Java-based tool is powerful enough to chomp Big Data and flexible enough to allow for good progress doing so, the coding is anything other than easy. Giving below a few effective and poplular alternatives to MapR:

  1. PigLatin, originally developed by Yahoo to maximize productivity and accommodate a complex procedural data flow, eventually became an Apache project, and has characteristics that resemble both scripting languages (like Python and Pearl) and SQL. In fact, many of the operations look like SQL: load, sort, aggregate, group, join, etc. It just isn’t as limited as SQL. Pig allows for input from multiple databases and output into a single data set.
  2. Hive also looks a lot like SQL at first glance. It accepts SQL-like statements and uses those statements to output Java MapReduce code. It requires little in the way of actual programming, so it’s a useful tool for teams that don’t have high-level Java skills or have fewer programmers with which to produce code. Initially developed by the folks at Facebook, Hive is now an Apache project.
  3. Spark is widely been hailed as the end of MapReduce. Born in the AMPLab at the University of California in Berkley, Spark, unlike Pig and Hive, which are merely programming interfaces for the execution framework, replaces the execution framework of MapReduce entirely. Spark provides most effective memory and resource usage, almost 100 times faster the MapR. Spark also provides many features, including stream processing, data transfer, fast fault recovery, optimized scheduling, and a lot more.

These alternatives have effectively reduced an organization’s dependency on Java.

If you are a late entrant into Big Data space, one benefit is that you won’t have to waddle through all of the platforms and products that came and went during the early years.

Initially it was just Hadoop Distributed File System (HDFS) then came MapReduce, Yarn and a plethora of various products, some of which blossomed and became mature parts of the Hadoop ecosystem. Others petered out or are still puttering around wondering what they’re going to be when they grow up.

BigDataEcosystem

Today, there are quite a number of Big Data products and platforms to pick from to assemble an infrastructure that meets your needs.

Giving below a few significant players, who have made a niche in the Big Data space:

Tez:  A generalized data-flow programming framework, built on Hadoop YARN, Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

Hive:  A Datawarehouse infrastructure that provides data summarization, to query HDFS, Database / support SQL, for Structured DB only. Best for BI Apps.  It can be used independently as well,

PigLatin: A high level data flow language, to query HDFS, shell, can be used independently as well. Best used for ETL

HBase: Scalable distributed database that supports structured data storage for large tables, No Sql database, Database / NoSQL, can be used independently as well. Can have downtime. Speed is fast. Best used for NoSql database

Cassandra:  Scalable multi-master database with no single points of failure, Uses No Sql database, Database / NoSQL, can be used independently as well. Limitation of slow read and write but highly scalable and high availability , Best used for Large/Sensitive NoSQL database.

Mahout:  Scalable machine learning and data mining library, Machine Learning, ML programming framework.

Oozie:  Workflow scheduler system, it is used for managing Hadoop apache jobs, Workflow Management.

Sqoop:   Import/Export utility, to import/export data from Hadoop to various database.

Flume: Robust, mature and proven tool for streaming data. Used to import/export real time data.

Hue: Hadoop user Experience, It is an web interface that supports Hadoop and its ecosystem.

Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Framework can be used independently as well.

Scala: Language in which spark is written. Language, can be used independently as well. Can run unmodified hive queries on existing Hadoop deployment, An alternative to HQL.

Spark streaming: Spark module for performing streaming analytics, Enables analytical and interactive apps for live streaming data. It’s a faster alternative of Flume.

Mlib: Spark module for performing Machine Learning, Machine Learning libraries being built on top of Spark, ML programming framework, can be used independently as well. It is a replacement of Mahout from Hadoop,

GraphX: Graph Computation engine. Combine data parallel and graph parallel concepts, engine.

SparkR: Package for R language to enable R users to leverage spark power from R shell, To access Spark data from R. A package.

PySpark:, To access spark data from Python, It’s  a shell.

Julia: Analytical/Mathematical Language used for Data Modeling/Visualization. It’s a language, can be used independently as well, for structured and unstructured data.

R: Analytical/Mathematical Language used for Data Modeling/Visualization, It’s a language, can be used independently as well, for Structured Database only.

Python: A General Purpose language. Can be used for data modeling. Can be used independently as well, for structured/unstructured data.

While Hive, PigLatin, HBase, Cassandra, Mahout, Oozie, Sqoop, Flume, Hue are Hadoop components,  Many of these tools like R, Python and Julia are originated independently and now part of ecosystem.

For more information contact me at atul@nexcen.in.

Advertisement