Is Hadoop Obsolete?

Let us first understand What is Hadoop, MapReduce, Big Data Framework and Spark

Hadoop is an Open Source framework for writing applications to processes structured, semi-structured or unstructured data that are stored in HDFS. Essentially Haddop is a distributed data infrastructure: Hadoop MapReduce is designed  to distribute massive data collections across multiple nodes within a cluster of commodity servers, which means you don’t need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously.

Hadoop is Scalable, and Fault tolerant framework written in Java. Hadoop is not only a storage system but is a platform for large data storage as well as processing.

Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn’t do distributed storage.

Apache Spark is an Open Source Big Data framework. It is faster and more general purpose data processing engine and is basically designed for fast computation. It covers a wide range of workloads such as batch, interactive, iterative and streaming.

So essentially Spark is a In-Memory Cluster Processing Framework while MapReduce involves multiple read-writes on disks.

Spark performs better than Hadoop when:

  1. data size ranges from GBs to PBs
  2. there is a varying algorithmic complexity, from ETL to SQL to machine learning
  3. low-latency streaming jobs to long batch jobs
  4. processing data regardless of storage medium, be it disks, SSDs, or memory

But if the size of the data is small (~100 MB) Hadoop can sometimes be faster when performing mapping in the data nodes.

Hadoop is used for Batch processing whereas Spark can be used for both. In this regard, Hadoop users can process using MapReduce tasks where batch processing is required. In theory, Spark can perform everything that Hadoop can and more. Thus it becomes a matter of comfort when it comes to choosing Hadoop or Spark.

Lets do some point wise comparison:

1.    Speed:

Apache Spark – Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible.

Hadoop MapReduce – MapReduce reads and writes from disk, as a result, it slows down the processing speed.

2.    Difficulty:

Apache Spark – Spark is easy to program as it has tons of high-level operators with RDD – Resilient Distributed Dataset.

Hadoop MapReduce – In MapReduce, developers need to hand code each and every operation which makes it very difficult to work.

3.    Easy to Manage

Apache Spark – Spark is capable of performing batch, interactive and Machine Learning and Streaming all in the same cluster. As a result makes it a complete data analytics engine. Thus, no need to manage different component for each need. Installing Spark on a cluster will be enough to handle all the requirements.

Hadoop MapReduce – As MapReduce only provides the batch engine. Hence, we are dependent on different engines. For example- Storm, Giraph, Impala, etc. for other requirements. So, it is very difficult to manage many components.

4.    Real-time analysis

Apache Spark – It can process real time data i.e. data coming from the real-time event streams at the rate of millions of events per second, e.g. Twitter data for instance or Facebook sharing/posting. Spark’s strength is the ability to process live streams efficiently.

Hadoop MapReduce – MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data.

5.    Fault tolerance

Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart the application from scratch in case of any failure.

Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant, so there is no need to restart the application from scratch in case of any failure.

6.    Security

Apache Spark – Spark is little less secure in comparison to MapReduce because it supports the only authentication through shared secret password authentication.

Hadoop MapReduce – Apache Hadoop MapReduce is more secure because of Kerberos and it also supports Access Control Lists (ACLs) which are a traditional file permission model.

Hadoop Structure

Hadoop is a Big Data tool which has three major layers:

  1. HDFS – Storage Layer
  2. Mapreduce – Processing Layer
  3. Yarn – Resource Management Layer (Yet Another Resource Negotiator)

Spark on the other hand is another Processing Layer. It uses HDFS as storage layer and Yarn for Resource management. Spark does not have it’s Storage layer and it is dependent on third party i.e. HDFS.

Spark1

Spark2

Hadoop is still the Backbone of Big Data and in Good demand.

Spark is the next stage in the evolution of this. The fundamental thinking is that fine grained mutable state is a very low level abstraction and building block for ML algorithms ; Hence Spark was an attempt to raise this abstraction to coarse grained immutable data called RDD’s ( Resilient DIstributed DataSets) ;

Spark3

Since HDFS never really supported multiple writer concurrent appends anyway , it follows that RDD’s are not giving up much by being immutable – whereas you gain a lot by  having both immutability and a higher level of abstraction to begin with for big data.

Use-cases where Spark fits best:

Real-Time Big Data Analysis:

Real-time data analysis means processing data generated by the real-time event streams coming in at the rate of millions of events per second, Twitter data for instance. The strength of Spark lies in its abilities to support streaming of data along with distributed processing. This is a useful combination that delivers near real-time processing of data. MapReduce is handicapped of such an advantage as it was designed to perform batch cum distributed processing on large amounts of data. Real-time data can still be processed on MapReduce but its speed is nowhere close to that of Spark.

Spark claims to process data 100x faster than MapReduce, while 10x faster with the disks.

Graph Processing:

Most graph processing algorithms like page rank perform multiple iterations over the same data and this requires a message passing mechanism. We need to program MapReduce explicitly to handle such multiple iterations over the same data. Roughly, it works like this: Read data from the disk and after a particular iteration, write results to the HDFS and then read data from the HDFS for next the iteration. This is very inefficient since it involves reading and writing data to the disk which involves heavy I/O operations and data replication across the cluster for fault tolerance. Also, each MapReduce iteration has very high latency, and the next iteration can begin only after the previous job has completely finished.

Also, message passing requires scores of neighboring nodes in order to evaluate the score of a particular node. These computations need messages from its neighbors (or data across multiple stages of the job), a mechanism that MapReduce lacks. Different graph processing tools such as Pregel and GraphLab were designed in order to address the need for an efficient platform for graph processing algorithms. These tools are fast and scalable, but are not efficient for creation and post-processing of these complex multi-stage algorithms.

Introduction of Apache Spark solved these problems to a great extent. Spark contains a graph computation library called GraphX which simplifies our life. In-memory computation along with in-built graph support improves the performance of the algorithm by a magnitude of one or two degrees over traditional MapReduce programs. Spark uses a combination of Netty and Akka for distributing messages throughout the executors. Let’s look at some statistics that depict the performance of the PageRank algorithm using Hadoop and Spark.

Iterative Machine Learning Algorithms:

Almost all machine learning algorithms work iteratively. As we have seen earlier, iterative algorithms involve I/O bottlenecks in the MapReduce implementations. MapReduce uses coarse-grained tasks (task-level parallelism) that are too heavy for iterative algorithms. Spark with the help of Mesos – a distributed system kernel, caches the intermediate dataset after each iteration and runs multiple iterations on this cached dataset which reduces the I/O and helps to run the algorithm faster in a fault tolerant manner.

Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverages iterations and yields better results than one pass approximations sometimes used on MapReduce.

Fast data processing. As we know, Spark allows in-memory processing. As a result, Spark is up to 100 times faster for data in RAM and up to 10 times for data in storage.

Iterative processing. Spark’s RDDs allow performing several map operations in memory, with no need to write interim data sets to a disk.

Near real-time processing. Spark is an excellent tool to provide immediate business insights. This is the reason why Spark is used in credit card’s streaming system.

It should be borne in mind that this answer leans heavily on fast, speed-of-thought analytics and ML perspective. Also, I will only consider scenarios that I actually observed people trying to use Spark for.

ETL+wrangling:

Do you have engineers that are capable of writing decent, most importantly, legible Scala? If no, probably stay with Hadoop. Unless there are other, non-technical goals (like a renewed marketing message).

Do you run, by any chance, interactive, selective, speed-of-thought queries (think OLAP)?

If yes, then no, Spark is not a spatial indexer. It can do it, but it will be locked to a “full table scan” solution, which means  it will do it, on average, at a 1000x higher cost than necessary, with the QPS 1000x lower than actually is feasible, for any average pivoting UI scenario.

Oh, but then, there are no true distributed, “big data” MOLAP solutions with a good distributed spatial index scanning in OSS domain today (no, e.g. Impala does not qualify as spatial scan engine), so… maybe; but try commercial vendors perhaps instead. What you gain in software license costs, you most likely will lose on hardware and programming effort tenfold, if you do not.

Either way, MapReduce cannot do it either.

If you only run queries that are always best optimized with a “full table scan” (i.e. low selectivity queries), or do ETL type of things only, sure, go ahead. SparkQL is pretty good for that and is fairly easy.

ML (Machine Learning):

Is the speed or interactivity of the ML computations important?

If yes, then most likely one needs to move beyond Spark. Spark, as far as numerical, iterative, shared-nothing platforms go, is about the slowest platform there is. Practically everything else that exists for that purpose (except for the Hadoop variety of MapReduce of course), has far better strong scaling properties than Spark; even in the free software realm.

There are two main problems with Spark that prevent it form being on par with the segment leaders for performance:  (a) a fine-grain, centralized, heavy-weight task scheduling, and (b) lack of efficient multicast programming model (as in MPI, GraphLab). This is illustrated, for example, here: Large Scale Machine Learning and Other Animals, and is true for any numerical solution that needs to iterate till convergence (which is almost everything there is). This may be a bit dated, but nothing has changed materially to date in this department.

This is especially bad with “super-step” architectures (i.e. GraphX vs. GraphLab).

If no, the described performance issues may be acceptable in your case upon evaluation; but keep in mind that there are much (much!) faster things around here that may collapse your hardware requirements potentially 100x times for a solution of equivalent volume and accuracy.

Are you designing your own ML algorithms?

If yes, you also probably need to move beyond Spark. Semantically, better choices for tensor math include BidMat/BidMach, SystemML and Mahout Samsara.

If you are going to do off-the-shelf work only, then maybe staying with Spark is fine iff your needs are completely covered with what exists today in MLLib or any other math that runs on Spark (there is more than one choice, by now).

Both ML + ETL:

Realistically, the only reason to switch to Spark rather than anything else  for any decision branch that includes ML component is when operational aspects trump performance requirements.

I.e. if you can say, “I am ok with my thing running for hours on dozens of nodes, while it really computes on couple of machines in the same time elsewhere, but i have single machine cluster deployment to care about for all of ETL, batch analytics, and slow-ish, mostly non-interactive ML”, then yes. Switch to Spark.

In either case, ML on Hadoop variety of the Mapreduce is dead. Has been for at least past 3 years.

Streaming maybe? Intra-day metric aggregators? Some online inference even?

Yes, sure. Spark; but it may be worth to look at Flink, since it is said to be a “true streaming” platform, whereas Spark is said to be a microbatch emulation of streaming. In  the end this distinction may not be significant to you enough though. (disclosure: I do not have a first hand experience with the Apache Flink).

That said, as others have mentioned, even if you switch to Spark, you do away with MR ecosystem (MapReduce, Pig, Hive etc.) but not some form of a Hadoop HDFS substrate. Which means, you’d still most likely be heavily married to one of Hadoop distributions out there.

So, the correct statement is MapReduce is being Replaced by Spark and not Hadoop as a whole. IF you are planning to learn Hadoop or make a future in it, please go ahead, it is not dying or dead. Big data is growing fast in leaps and bounds, it’s time to switch. Read also: The Digital World

For more information on Big Data consulting write to me atul@nexcen.in

 

Advertisement

The Digital World

The digital world is expected to grow 10 times in the next 2 years, from 5B connected devices to 50B connected devices and 4T GB of data to 40T GB, by 2020. This 10X leap in such a short span of time is a first of its kind technology transformation. Digital technologies today have already started to change the fundamental behavior of people, customers, industries, companies, products and services. More than 50 percent of high performing companies have been experimenting and learning from early deployments of technological advancements.

With fast developing digital solutions, the challenges of ensuring quality deployment of these digital solutions are also many. Test strategies and approaches across development to deployment need to be relooked at. Solutions evolving via experimenting and yet we are compelled to make rapid deployments to grab competitive advantage. Solutions involve multiple technologies, tools, interfaces and industry standards. Complexity is growing with the involvement of a larger number of eco-system players; theses eco-players are still evolving; some of these eco-system players did not have the need to interact closely before, and are for the first time collaborating and defining roles. Compelling user experience demands an innumerable number of use cases in the complex systems, an ability to adapt to newer interfaces and technologies that will continue to emerge, and so on. These open up tremendous opportunities for us to bring in innovative test services.

Big Data and Cloud Computing are the two amongst few emerging trends and technologies in digital world.

  1. Big Data Analytics is here to resolve 5 Vs: Velocity (Speed), Volume, Value, Variety (Structured, Unstructured, Semi-structured), and Veracity (Conformity & Relevance). With high speed internet becoming more and more handy coupled with falling prices of broadband data connection, AI, Augmented Intelligence with Machine Learning getting wider acceptance across Enterprise IT World, the apprehensions about data security getting busted and low cost of high speed and storage resulting in high volumes of data; all lead to growing demand of Big Data. Read More at NexCen Blogs here: https://nexcenblog.wordpress.com/2018/01/28/big-data-revolution/ or Contact us at consulting@nexcen.in for a delightful experience transforming big data into valuable information. Read our Big Data service offerings here https://nexcen.in/bigdata.php.
  2. Cloud Computing: Simply put, cloud computing is the delivery of computing services—servers, storage, databases, networking, software, analytics and more, over the Internet (“the cloud”) on Pay As You Use basis. Companies offering these computing services are called cloud providers, charge for cloud computing services based on usage. Apart from trading capital expense for variable by assessing usage, can avail cost benefits from hundreds of thousands of customers aggregated in the cloud, IT resources are only ever a click away, which means you reduce the time it takes to make those resources available to your developers from weeks to just minutes. This results in a dramatic increase in agility for the organization, since the cost and time it takes to experiment and develop is significantly lower. You can Go global in minutes by deploying your application in multiple regions around the world with just a few clicks. Contact us at consulting@nexcen.in for our Cloud Computing service offerings or visit here https://nexcen.in/cloud.php.

Big Data Revolution

BigData1As we welcome the new year 2018, I clearly see the emerging trends leading to Big Data. Calling it a revolution as it has changed the whole perspective and paradigm to look and treat the data. With high speed internet becoming more and more handy with masses with FTTH (Fiber To The Home) technology coupled with falling prices of broadband data connection, AI, Augmented Intelligence with Machine Learning getting wider acceptance across Enterprise IT World, the apprehensions about data security getting busted and low cost of high speed and storage resulting in high volumes of data; all lead to growing demand of Big Data.

BigData2

Started with 4Vs (Volume, Velocity, Variety and Veracity) Big Data has come a long way and is still evolving, evolving fast, adding more and more tools and techniques into its ecosystem.  The misconceptions, the myths about Big Data limitations and usage are fast fading out and the world is gearing up to embrace it wholeheartedly.

Let’s talk about the myths about limitations of Big Data:

Myth 1:  Hadoop is only for batch processing. The fact is it does provides real time analytics and interacts well with other real time big data tools like Scala.

Myth 2:  Data Security is not enough. With the evolution, Big data has come a long way and the ecosystem has built enterprise grade security onto Hadoop platforms including Cloudera, Hortonworks, MapR etc. There are now excellent data governance capabilities

Myth 3:  Big Data is for unstructured data only. Big Data widely used for unstructured data, providing alternate strategies and solutions to store huge volumes of unstructured data, providing parallel processing to reduce storage and access time, but is equally good for structured data.

In fact many of these myths are limited to MapReduce usage but as the Big Data evolves, there are plenty of alternate technologies and tools become part of ecosystem. While MapR, a Java-based tool is powerful enough to chomp Big Data and flexible enough to allow for good progress doing so, the coding is anything other than easy. Giving below a few effective and poplular alternatives to MapR:

  1. PigLatin, originally developed by Yahoo to maximize productivity and accommodate a complex procedural data flow, eventually became an Apache project, and has characteristics that resemble both scripting languages (like Python and Pearl) and SQL. In fact, many of the operations look like SQL: load, sort, aggregate, group, join, etc. It just isn’t as limited as SQL. Pig allows for input from multiple databases and output into a single data set.
  2. Hive also looks a lot like SQL at first glance. It accepts SQL-like statements and uses those statements to output Java MapReduce code. It requires little in the way of actual programming, so it’s a useful tool for teams that don’t have high-level Java skills or have fewer programmers with which to produce code. Initially developed by the folks at Facebook, Hive is now an Apache project.
  3. Spark is widely been hailed as the end of MapReduce. Born in the AMPLab at the University of California in Berkley, Spark, unlike Pig and Hive, which are merely programming interfaces for the execution framework, replaces the execution framework of MapReduce entirely. Spark provides most effective memory and resource usage, almost 100 times faster the MapR. Spark also provides many features, including stream processing, data transfer, fast fault recovery, optimized scheduling, and a lot more.

These alternatives have effectively reduced an organization’s dependency on Java.

If you are a late entrant into Big Data space, one benefit is that you won’t have to waddle through all of the platforms and products that came and went during the early years.

Initially it was just Hadoop Distributed File System (HDFS) then came MapReduce, Yarn and a plethora of various products, some of which blossomed and became mature parts of the Hadoop ecosystem. Others petered out or are still puttering around wondering what they’re going to be when they grow up.

BigDataEcosystem

Today, there are quite a number of Big Data products and platforms to pick from to assemble an infrastructure that meets your needs.

Giving below a few significant players, who have made a niche in the Big Data space:

Tez:  A generalized data-flow programming framework, built on Hadoop YARN, Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

Hive:  A Datawarehouse infrastructure that provides data summarization, to query HDFS, Database / support SQL, for Structured DB only. Best for BI Apps.  It can be used independently as well,

PigLatin: A high level data flow language, to query HDFS, shell, can be used independently as well. Best used for ETL

HBase: Scalable distributed database that supports structured data storage for large tables, No Sql database, Database / NoSQL, can be used independently as well. Can have downtime. Speed is fast. Best used for NoSql database

Cassandra:  Scalable multi-master database with no single points of failure, Uses No Sql database, Database / NoSQL, can be used independently as well. Limitation of slow read and write but highly scalable and high availability , Best used for Large/Sensitive NoSQL database.

Mahout:  Scalable machine learning and data mining library, Machine Learning, ML programming framework.

Oozie:  Workflow scheduler system, it is used for managing Hadoop apache jobs, Workflow Management.

Sqoop:   Import/Export utility, to import/export data from Hadoop to various database.

Flume: Robust, mature and proven tool for streaming data. Used to import/export real time data.

Hue: Hadoop user Experience, It is an web interface that supports Hadoop and its ecosystem.

Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Framework can be used independently as well.

Scala: Language in which spark is written. Language, can be used independently as well. Can run unmodified hive queries on existing Hadoop deployment, An alternative to HQL.

Spark streaming: Spark module for performing streaming analytics, Enables analytical and interactive apps for live streaming data. It’s a faster alternative of Flume.

Mlib: Spark module for performing Machine Learning, Machine Learning libraries being built on top of Spark, ML programming framework, can be used independently as well. It is a replacement of Mahout from Hadoop,

GraphX: Graph Computation engine. Combine data parallel and graph parallel concepts, engine.

SparkR: Package for R language to enable R users to leverage spark power from R shell, To access Spark data from R. A package.

PySpark:, To access spark data from Python, It’s  a shell.

Julia: Analytical/Mathematical Language used for Data Modeling/Visualization. It’s a language, can be used independently as well, for structured and unstructured data.

R: Analytical/Mathematical Language used for Data Modeling/Visualization, It’s a language, can be used independently as well, for Structured Database only.

Python: A General Purpose language. Can be used for data modeling. Can be used independently as well, for structured/unstructured data.

While Hive, PigLatin, HBase, Cassandra, Mahout, Oozie, Sqoop, Flume, Hue are Hadoop components,  Many of these tools like R, Python and Julia are originated independently and now part of ecosystem.

For more information contact me at atul@nexcen.in.

Big Data Analytics

Big Data Analytics is fast becoming reality and the need for many sectors. The definition of big data holds the key to understanding the big data analysis. According to the Gartner IT Glossary, Big Data is high-volume, high-velocity, and high-variety information assets that demand cost effective, innovative forms of information processing for enhanced insight and decision making. Many factors amounts to need of Big Data analytics, first being high volume of data: sensor and machine-generated data, networks, social media, and much more. Enterprises are awash with terabytes and, increasingly, petabytes of big data. As infrastructure improves along with storage technology, it has become easier for enterprises to store more data than ever before. Second is data being unstructured and unsystematic. Big data extends beyond structured data such as numbers, dates, and strings to include unstructured data such as text, video, audio, click streams, 3D data, and log files. The more sources that data is collected from, the more variety will be found within data assets. Third is the Speed to process huge volume of data. The pace at which data streams in from sources such as mobile devices, click-streams, high-frequency stock trading, and machine-to-machine processes is massive and continuously fast moving. The faster that pace becomes, the more data can be analyzed for discovering new insights.

Like conventional analytics and business intelligence solutions, big data mining and analytics helps uncover hidden patterns, unknown correlations, and other useful business information. However, big data tools can analyze high-volume, high-velocity, and high-variety information assets far better than conventional tools and relational databases that struggle to capture, manage, and process big data within a tolerable elapsed time and at an acceptable total cost of ownership.

 Big Data Analytics is here now

Big data and big data tools offer many benefits. The main business advantages of big data generally fall into one of three categories: cost savings, competitive advantage, or new business opportunities.

Cost Savings

Big data tools like Hadoop allow businesses to store massive volumes of data at a much cheaper price tag than a traditional database. Companies utilizing big data tools for this benefit typically use Hadoop clusters to augment their current data warehouse, storing long-term data in Hadoop rather than expanding the data warehouse. Data is then moved from Hadoop to the traditional database for production and analysis as needed. Versatile big data tools can also function as multiple tools at once, saving organizations on the cost of needing to purchase more tools for the same tasks.

Competitive Advantage

According to a survey of 540 enterprise decision makers involved in big data purchases by Webopedia’s parent company QuinStreet, about half of all respondents said they were applying big data and analytics to improve customer retention, help with product development, and gain a competitive advantage. One of the major advantages of big data analytics is that it gives businesses access to data that was previously unavailable or difficult to access. With increased access to data sources such as social media streams and clickstream data, businesses can better target their marketing efforts to customers, better predict demand for a certain product, and adapt marketing and advertising messaging in real-time. With these advantages, businesses are able to gain an edge on their competitors and act more quickly and decisively when compared to what rival organizations do. Needless to say, a business that effectively utilizes big data analytics tools will be much better prepared for the future than one that doesn’t understand how important those tools are.

New Business Opportunities

The final benefit of big data analytics tools is the possibility of exploring new business opportunities. Entrepreneurs have taken advantage of big data technology to offer new services in AdTech and MarketingTech. Mature companies can also take advantage of the data they collect to offer add-on services or to create new product segments that offer additional value to their current customers. In addition to those benefits, big data analytics can pinpoint new or potential audiences that have yet to be tapped by the enterprise. Finding whole new customer segments can lead to tremendous new value.

These are just a few of the actionable insights made possible by available big data analytics tools. Whether an organization is looking to boost sales and marketing results, uncover new revenue opportunities, improve customer service, optimize operational efficiency, reduce risk, improve security, or drive other business results, big data insights can help.

Use cases for big data analysis

Big data analytics lends itself well to a large variety of use cases spread across multiple industries. Financial institutions can quickly find that big data analysis is adept at identifying fraud before it becomes widespread, preventing further damage. Governments have turned to big data analytics to increase their security and combat outside cyber threats. The healthcare industry uses big data to improve patient care and discover better ways to manage resources and personnel. Telecommunications companies and others utilize big data analytics to prevent customer churn while also planning the best ways to optimize new and existing wireless networks. Marketers have quite a few ways they can use big data. One involves sentiment analysis, where marketers can collect data on how customers feel about certain products and services by analyzing what consumers post on social media sites like Facebook and Twitter.

The number of use cases are plentiful, and no industry should think that analytics couldn’t be used in some way to improve their businesses. That type of versatility is part of what has made big data so popular. And these are only a few examples of use cases. As companies and other organizations become more familiar with all of the capabilities granted through big data analytics, more use cases will likely be discovered, adding to big data’s overall value. As with any developing technology, the process may take some time, but eventually its widespread use will lead to the discovery of even more benefits and uses.

BigDataUsage

Top Big Data Tools Overview:

Apache Hadoop

Hadoop is an open source software framework originally developed by Doug Cutting and Mike Cafarella in 2006. It was specifically built to handle very large data sets. Hadoop is made up of two main parts: the Hadoop Distributed File System (HDFS) and MapReduce. HDFS is the storage component of Hadoop. Hadoop stores data by splitting files into large blocks and distributing it across nodes. MapReduce is the processing engine of Hadoop. Hadoop processes data by delivering code to nodes to process in parallel.

Apache Spark

Apache Spark is quickly growing as a data analytics tool. It is an open source framework for cluster computing. Spark is frequently used as an alternate to Hadoop’s MapReduce because it it is able to analyze data up to 100 times faster for certain applications. Common use cases for Apache Spark include streaming data, machine learning and interactive analysis.

Apache Hive

Apache Hive is a SQL-on-Hadoop data processing engine. Apache Hive excels at batch processing of ETL jobs and SQL queries. Hive utilizes a query language called HiveQL. HiveQL is based on SQL, but does not strictly follow the SQL-92 standard.

NoSQL Databases

NoSQL databases have grown in popularity. These Not Only SQL databases are not bound by traditional schema models allowing them to collect unstructured datasets. The flexibility of NoSQL databases like MongoDB, Cassandra HBase make them a popular option for big data analytics.

Column-oriented databases

Traditional, row-oriented databases are excellent for online transaction processing with high update speeds, but they fall short on query performance as the data volumes grow and as data becomes more unstructured. Column-oriented databases store data with a focus on columns, instead of rows, allowing for huge data compression and very fast query times. The downside to these databases is that they will generally only allow batch updates, having a much slower update time than traditional models.

Schema-less databases, or NoSQL databases

There are several database types that fit into this category, such as key-value stores and document stores, which focus on the storage and retrieval of large volumes of unstructured, semi-structured, or even structured data. They achieve performance gains by doing away with some (or all) of the restrictions traditionally associated with conventional databases, such as read-write consistency, in exchange for scalability and distributed processing.

MapReduce

This is a programming paradigm that allows for massive job execution scalability against thousands of servers or clusters of servers. Any MapReduce implementation consists of two tasks:

  • The “Map” task, where an input dataset is converted into a different set of key/value pairs, or tuples;
  • The “Reduce” task, where several of the outputs of the “Map” task are combined to form a reduced set of tuples (hence the name).

Hadoop

Hadoop is by far the most popular implementation of MapReduce, being an entirely open source platform for handling Big Data. It is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs. It has several different applications, but one of the top use cases is for large volumes of constantly changing data, such as location-based data from weather or traffic sensors, web-based or social media data, or machine-to-machine transactional data.

Hive

Hive is a “SQL-like” bridge that allows conventional BI applications to run queries against a Hadoop cluster. It was developed originally by Facebook, but has been made open source for some time now, and it’s a higher-level abstraction of the Hadoop framework that allows anyone to make queries against data stored in a Hadoop cluster just as if they were manipulating a conventional data store. It amplifies the reach of Hadoop, making it more familiar for BI users.

PIG

PIG is another bridge that tries to bring Hadoop closer to the realities of developers and business users, similar to Hive. Unlike Hive, however, PIG consists of a “Perl-like” language that allows for query execution over data stored on a Hadoop cluster, instead of a “SQL-like” language. PIG was developed by Yahoo!, and, just like Hive, has also been made fully open source.

WibiData

WibiData is a combination of web analytics with Hadoop, being built on top of HBase, which is itself a database layer on top of Hadoop. It allows web sites to better explore and work with their user data, enabling real-time responses to user behavior, such as serving personalized content, recommendations and decisions.Perhaps the greatest limitation of Hadoop is that it is a very low-level implementation of MapReduce, requiring extensive developer knowledge to operate. Between preparing, testing and running jobs, a full cycle can take hours, eliminating the interactivity that users enjoyed with conventional databases.

PLATFORA

PLATFORA is a platform that turns user’s queries into Hadoop jobs automatically, thus creating an abstraction layer that anyone can exploit to simplify and organize datasets stored in Hadoop.

Storage Technologies

As the data volumes grow, so does the need for efficient and effective storage techniques. The main evolutions in this space are related to data compression and storage virtualization.

SkyTree

SkyTree is a high-performance machine learning and data analytics platform focused specifically on handling Big Data. Machine learning, in turn, is an essential part of Big Data, since the massive data volumes make manual exploration, or even conventional automated exploration methods unfeasible or too expensive

Big Data in the cloud

As we can see, most of these technologies are closely associated with the cloud. Most cloud vendors are already offering hosted Hadoop clusters that can be scaled on demand according to their user’s needs. Also, many of the products and platforms mentioned are either entirely cloud-based or have cloud versions themselves.

Big Data and cloud computing go hand-in-hand. Cloud computing enables companies of all sizes to get more value from their data than ever before, by enabling blazing-fast analytics at a fraction of previous costs. This, in turn drives companies to acquire and store even more data, creating more need for processing power and driving a virtuous circle.

What is big data in the cloud?

Taking big data to the cloud offers up a number of advantages. Improvements come in the form of better performance, targeted cloud optimizations, more reliability, and greater value. Big data in the cloud gives businesses the type of organizational scale many are searching for. This allows many users, sometimes in the hundreds, to query data while only being overseen by a single administrator. That means little supervision is required.

Big data in the cloud also allows organizations to scale quickly and easily. This scaling is done according to the customer’s workload. If more clusters are needed, the cloud can give them the extra boost. During times of less activity, everything can be scaled down. This added flexibility is particularly valuable for companies that experience varying peak times. Big data in the cloud also takes advantage of the benefits of cloud infrastructure, whether they be from Amazon Web Services, Microsoft Azure, Google Cloud Platform, or others.

What are data lakes?

Gathering data from various sources is, of course, only one part of the big data analytics process. All that data needs to be stored somewhere, and that repository is often referred to as a data lake. Data lakes are where data is kept in its raw form, before any organizational structure is used and before any analytics are performed. Data lakes don’t use the traditional structure of files or folders but rather use a flat architecture where each element has its own identifier, making it easy to find when queried. Data lakes are a type of object storage that Hadoop uses, making it an effective way to describe where Hadoop-supported platforms pull their data from. One major benefit of having a data lake is the ability to store massive amounts of data. As big data continues to grow, the need for that near limitless storage capability has grown with it. Data lakes also allow for added processing power while also providing the ability to handle numerous jobs at the same time. These are all capabilities that have been increasingly in demand as more enterprises use big data analytics tools.

How NexCen supports big data analytics?

From simple spreadsheets to advanced analytics and marketing solutions to analytics engines, leading self-service big data platform provides effortless integration to centrally analyze your data all in one spot.

Spreadsheets and Analytics Tools: Through ODBC connectors, NexCen customers can connect to Microsoft Excel and tools from leading analytics vendors such as Tableau, Qlik, MicroStrategy, and TIBCO Jaspersoft. In addition, the R statistical programming language can be integrated ODBC/REST APIs.

Analytics Engines: NexCen offers connectors for massively parallel processing databases such as Vertica as well as relational database engines such as Microsoft SQL Server and the MySQL open source database, and NoSQL databases such as MongoDB.

CRM and Online Marketing Solutions: NexCen also connects to leading CRM and online marketing platforms such as Salesforce.com and online marketing and web analytics solutions such as Omniture and Google Analytics.

NexCen ways of Processing Big Data

unified interface for performing the myriad of use cases and workloads that a data driven organization will face ranging from ad hoc analysis, predictive analysis, machine learning, streaming and MapReduce to name a few. Users without software development skills can leverage the QDS workbench through our SmartQuery interface without even knowing how to write a SQL query.

Workbench

The NexCen Workbench enables data scientists and analysts to drive their ah-hoc workloads through their processing engines of choice using an easy-to-use SQL query composer or SmartQuery builder tool.

Custom Applications

Developers can configure their applications to drive various workloads using a number of common language options.

BI & Visualization Systems

Drive QDS workload using industry-leading, ODBC compliant BI & Visualization tools, including: Tableau, Birst, Qlik, Pentaho etc

Looking for more information on how to be effective with big data analytics tools? Write to us at consulting@nexcen.in