Cloud War Intensifying!

While the choice of private / hybrid / public cloud might become easier now with lot of awareness and information available on cloud storage addressing related security apprehensions, now choosing the right cloud partner becomes a larger matter.

In the public cloud computing market, three big vendors dominate: Amazon Web Services, Microsoft Azure and Google Cloud Platform (GCP) When it comes to infrastructure as a service (IaaS) and platform as a service (PaaS), these three have a huge lead on the rest of the market. However their approach is predominantly driven by their strengths and backgrounds. Amazon is the first mover amongst the three and hence has a wealth of experience in managing volumes of data. Microsoft being a technology giant has vast experience of computing. Google relatively new in cloud space, drive strength from data &  analytics. So depending upon their strengths and offerings, you may have to weigh your options and may have to settle down for a combination depending upon your organization goals and long term objectives.

We’ll analyze the three giants in three major categories; Features, Implementation and Pricing.

 

Features

All three of AWS, GCP and Azure use different terminology, codes, and nomenclatures to define their cloud products.

All the three cloud players have their own way of categorizing the different elements. Hence it’s important to define your organization’s requirements in terms of current and future cloud space and data requirements. Once you have clear strategy in place, you can start working with cloud partners to devise best solution for your organization.

 AWS

1

AWS Solutions cover a large degree of categorization, namely:

  • Websites
  • Backup and Recovery
  • Archiving
  • Disaster Recovery
  • DevOps
  • Big Data

Azure

Microsoft came late to the cloud market but gave itself a jump start by essentially taking its on-premises software – Windows Server, Office, SQL Server, SharePoint, Dynamics Active Directory, .Net, and others to the cloud.

Azure is tightly integrated with the other applications, Existing Microsoft customers or enterprises that use a lot of Microsoft software find it logical to go for Azure. Also, there is a significant discount off service contracts for existing Microsoft enterprise customers.

Azure operates by number of users

Azure boasts of maximum certifications from industry leaders and engagements with large organizations. Azure offers competitive pricing on committed usage.

2

With these certifications they believed this security can persuade organizations to place their trust in Microsoft. Like AWS, Azure equally provides an enormous array of features, and add value by providing certain capabilities based on the number of users.

3

GCP (Google Cloud Platform)

Lastly, you have the GCP (Google Cloud platform), which while not necessarily the most historical cloud provider, has entered into Enterprise Computing with a big bang.  Google offers future proof infrastructure with multiple regions and zones across the globe, very strong Data & Analytics and moreover its Server less, Just code.

Google Cloud Platform has three major constituents:

  1. Data Center
  2. POP : Points Of Presence
  3. GGC: Google Global Cache

4

Implementation

AWS

AWS provide a nice easy page (https://aws.amazon.com/developers/getting-started/) to start with their services. They have categorized services platform wise and provide sample code to begin the integration.

5

Azure

Having developers and operations in mind, Azure also makes it simplified and easier for the users to start with their services with detailed guides.

Microsoft Azure ranks highest in Development and Testing tools with excellent links with Microsoft on-premise systems such as Windows Server, System Center and Active Directory. It also has strong PaaS capabilities; however downside is its outages and limitation for supporting other platforms. Azure offers Virtual Machines, Cloud Services and Resource Manager for App Development and Auto-scaling.

GCP

Like AWS and Azure GCP also provide some starting documentation and list some useful benefits, see below.

6

Pricing

Best part about the pricing for clouds is all the three offer competitive pricing, they all are fighting to grab as much space as possible to migrate workloads to cloud. This war has commoditized this space to a great extent, to benefit the users, in return the providers will get recurring revenue. With open, transparent and pay per use per workload and minutes is particularly helpful to the small and medium business user.

Also as an organization you just have to register your activities and the cloud will adjust accordingly.

AWS

The three-tier model, according to storage of Amazon Web Services is very helpful if you just need to put some data in the cloud. However, when it comes to storing 50TB – 500TB, the price difference isn’t that large, more so for feature differences,  So ideally Amazon is great for large databases.

7

Helpful Resources

TCO is important in building a business case and gaining a better estimation of what is needed to hit organisational needs.

Azure

Azure also has a breakdown of the various pricing situations, When it comes to moving applications into the cloud,  their pricing is more aggressive than Amazon and Google, owing to their desire to lead this segment of the cloud.

Helpful Resources

For Azure, the TCO calculator answers the following questions:

Would you like to lower the total cost of ownership of your on-premises infrastructure?

What are the estimated cost savings of migrating application workloads to Microsoft Azure?

GCP

Being a late entrant, Google’s pricing model attempts to beat or go head-to-head with its core competitors, billing based on exact usage. Google is also giving $300 credit for anyone to start with GCP.

Here are some of their core pricing values:

8

PROS & CONS

Cloud Pros Cons
AWS Massive scope of operations. Has a huge and growing array of available services with a comprehensive network of worldwide data centers. Costly and ambigous cost structure
Azure Tightly integrated with the microsoft applications, Existing Microsoft customers or enterprises that use a lot of Microsoft software find it logical to go for Azure.
There is significant discounts off service contracts for existing Microsoft enterprise customers.
Microsoft Azure’s service experience feels less enterprise-ready than they expected. Customers cite issues with technical support, documentation, training and breadth of the ISV partner ecosystem. Also Azure doesn’t offers as much support for DevOps approaches as  integrated automation as some of the other cloud platforms.
GCP Google has a strong offering in containers, since Google developed the Kubernetes standard that AWS and Azure now offer. GCP specializes in high compute offerings like Big Data, analytics and machine learning. It also offers considerable scale and load balancing – Google knows data centers and fast response time. Google offers deep discounts and flexible contracts with commitment to open source and DevOps expertise. Google entered late and is a distant third in market share, it doesn’t offer as many different services and features as AWS and Azure, although it is quickly expanding and GCP is increasingly chosen as a strategic alternative to AWS by customers whose businesses compete with Amazon, more open-source-centric or DevOps-centric.

AWS, Azure, Google: Vendor Pages

The following are links to the AWS’s, Azure’s and Google’s own pages about a variety of tools, from compute to storage to advanced tools:

       AWS Azure GCP
Regions Global Infrastructure Regions Regions and Zones
Pricing Cloud Services Pricing Pricing Pricing
Basic Compute EC2 Virtual Machines Compute Engine
Containers ECS

EKS

AKS

Container Instances

Kubernetes Engine
Serverless Lambda Functions Cloud Functions
App Hosting Elastic Beanstalk App Service

Service Fabric

Cloud Services

App Engine
Batch Processing Batch Batch N/A
Object Storage S3 Blob Storage Cloud Storage
Block Storage EBS N/A Persistent Disk
File Storage EFS File Storage N/A
Hybrid Storage Storage Gateway StorSimple N/A
Offline Data Transfer Snowball

Snowball Edge

Snowmobile

N/A Transfer Appliance

 

Relational/SQL Database RDS

Aurora

SQL Database

Database for MySQL

Database for PostgreSQL

Cloud SQL

Cloud Spanner

NoSQL Database DynamoDB Cosmos DB

Table Storage

Cloud Bigtable

Cloud Datastore

In-Memory Database Elasticache Redis Cache N/A
Archive/Backup Glacier Backup N/A
Disaster Recovery N/A Site Recovery N/A
Machine Learning SageMaker

AML

Apache MXNet on AWS

TensorFlow on AWS

Machine Learning Cloud Machine Learning Engine
Cognitive Services Comprehend

Lex

Polly

Rekognition

Translate

Transcribe

Cognitive Services Cloud Natural Language

Cloud Speech API

Cloud Translation API

Cloud Video Intelligence

IoT IoT Core IoT Hub

IoT Edge

Cloud IoT Core
Networking Direct Connect Virtual Network Cloud Interconnect

Network Service Tiers

Content Delivery CloudFront CDN Cloud CDN
Big Data Analytics Athena

EMR

Kinesis

HDInsight

Stream Analytics

Data Lake Analytics

Analysis Services

Cloud Dataflow

Cloud Dataproc

Authentication and Access Management IAM

Directory Service

Organizations

Single Sign-On

Active Directory

Multi-Factor Authentication

Cloud IAM

Cloud IAP

Security GuardDuty

Macie

Shield

WAF

Security Center Cloud DLP

Cloud Security Scanner

 

Application Lifecycle Management CodeStar

CodePipeline

Visual Studio Team Services

Visual Studio App Center

N/A
Cloud Monitoring CloudWatch

CloudTrail

Monitor

Log Analytics

Stackdriver
Cloud Management Systems Manager

Management Console

Portal

Policy

Cost Management

Stackdriver
AR & VR Sumerian N/A N/A
Virtual Private Cloud VPC N/A Virtual Private Cloud
Training Training and Certification Training Training Programs
Support Support Support Support
3rd Party Software and Services Marketplace Marketplace Cloud Launcher

Partner Directory

So you got features, implementation, pricing, pros & cons and useful links, a fairly lengthy blog for a serious read and consideration.

While you may not easily come to a conclusion, at least you will hopefully have the knowledge to make a balanced decision.

In general, You can focus on your organization’s priorities that provide the most value to your organization.

Unfortunately, we often see organisations that are so committed to AWS for example, that they fail to recognize possibly more economic, and efficient alternatives. The levels of cloud tiers do vary greatly and it’s good to compare them against your requirement.

Well, that’s our summary, make sure to let us know which of AWS vs Azure vs Google Cloud wins in your mind below. For more information write to me at atul@nexcenglobal.com

All the best!

Advertisements

Is Hadoop Obsolete?

Let us first understand What is Hadoop, MapReduce, Big Data Framework and Spark

Hadoop is an Open Source framework for writing applications to processes structured, semi-structured or unstructured data that are stored in HDFS. Essentially Haddop is a distributed data infrastructure: Hadoop MapReduce is designed  to distribute massive data collections across multiple nodes within a cluster of commodity servers, which means you don’t need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously.

Hadoop is Scalable, and Fault tolerant framework written in Java. Hadoop is not only a storage system but is a platform for large data storage as well as processing.

Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn’t do distributed storage.

Apache Spark is an Open Source Big Data framework. It is faster and more general purpose data processing engine and is basically designed for fast computation. It covers a wide range of workloads such as batch, interactive, iterative and streaming.

So essentially Spark is a In-Memory Cluster Processing Framework while MapReduce involves multiple read-writes on disks.

Spark performs better than Hadoop when:

  1. data size ranges from GBs to PBs
  2. there is a varying algorithmic complexity, from ETL to SQL to machine learning
  3. low-latency streaming jobs to long batch jobs
  4. processing data regardless of storage medium, be it disks, SSDs, or memory

But if the size of the data is small (~100 MB) Hadoop can sometimes be faster when performing mapping in the data nodes.

Hadoop is used for Batch processing whereas Spark can be used for both. In this regard, Hadoop users can process using MapReduce tasks where batch processing is required. In theory, Spark can perform everything that Hadoop can and more. Thus it becomes a matter of comfort when it comes to choosing Hadoop or Spark.

Lets do some point wise comparison:

1.    Speed:

Apache Spark – Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible.

Hadoop MapReduce – MapReduce reads and writes from disk, as a result, it slows down the processing speed.

2.    Difficulty:

Apache Spark – Spark is easy to program as it has tons of high-level operators with RDD – Resilient Distributed Dataset.

Hadoop MapReduce – In MapReduce, developers need to hand code each and every operation which makes it very difficult to work.

3.    Easy to Manage

Apache Spark – Spark is capable of performing batch, interactive and Machine Learning and Streaming all in the same cluster. As a result makes it a complete data analytics engine. Thus, no need to manage different component for each need. Installing Spark on a cluster will be enough to handle all the requirements.

Hadoop MapReduce – As MapReduce only provides the batch engine. Hence, we are dependent on different engines. For example- Storm, Giraph, Impala, etc. for other requirements. So, it is very difficult to manage many components.

4.    Real-time analysis

Apache Spark – It can process real time data i.e. data coming from the real-time event streams at the rate of millions of events per second, e.g. Twitter data for instance or Facebook sharing/posting. Spark’s strength is the ability to process live streams efficiently.

Hadoop MapReduce – MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data.

5.    Fault tolerance

Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart the application from scratch in case of any failure.

Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant, so there is no need to restart the application from scratch in case of any failure.

6.    Security

Apache Spark – Spark is little less secure in comparison to MapReduce because it supports the only authentication through shared secret password authentication.

Hadoop MapReduce – Apache Hadoop MapReduce is more secure because of Kerberos and it also supports Access Control Lists (ACLs) which are a traditional file permission model.

Hadoop Structure

Hadoop is a Big Data tool which has three major layers:

  1. HDFS – Storage Layer
  2. Mapreduce – Processing Layer
  3. Yarn – Resource Management Layer (Yet Another Resource Negotiator)

Spark on the other hand is another Processing Layer. It uses HDFS as storage layer and Yarn for Resource management. Spark does not have it’s Storage layer and it is dependent on third party i.e. HDFS.

Spark1

Spark2

Hadoop is still the Backbone of Big Data and in Good demand.

Spark is the next stage in the evolution of this. The fundamental thinking is that fine grained mutable state is a very low level abstraction and building block for ML algorithms ; Hence Spark was an attempt to raise this abstraction to coarse grained immutable data called RDD’s ( Resilient DIstributed DataSets) ;

Spark3

Since HDFS never really supported multiple writer concurrent appends anyway , it follows that RDD’s are not giving up much by being immutable – whereas you gain a lot by  having both immutability and a higher level of abstraction to begin with for big data.

Use-cases where Spark fits best:

Real-Time Big Data Analysis:

Real-time data analysis means processing data generated by the real-time event streams coming in at the rate of millions of events per second, Twitter data for instance. The strength of Spark lies in its abilities to support streaming of data along with distributed processing. This is a useful combination that delivers near real-time processing of data. MapReduce is handicapped of such an advantage as it was designed to perform batch cum distributed processing on large amounts of data. Real-time data can still be processed on MapReduce but its speed is nowhere close to that of Spark.

Spark claims to process data 100x faster than MapReduce, while 10x faster with the disks.

Graph Processing:

Most graph processing algorithms like page rank perform multiple iterations over the same data and this requires a message passing mechanism. We need to program MapReduce explicitly to handle such multiple iterations over the same data. Roughly, it works like this: Read data from the disk and after a particular iteration, write results to the HDFS and then read data from the HDFS for next the iteration. This is very inefficient since it involves reading and writing data to the disk which involves heavy I/O operations and data replication across the cluster for fault tolerance. Also, each MapReduce iteration has very high latency, and the next iteration can begin only after the previous job has completely finished.

Also, message passing requires scores of neighboring nodes in order to evaluate the score of a particular node. These computations need messages from its neighbors (or data across multiple stages of the job), a mechanism that MapReduce lacks. Different graph processing tools such as Pregel and GraphLab were designed in order to address the need for an efficient platform for graph processing algorithms. These tools are fast and scalable, but are not efficient for creation and post-processing of these complex multi-stage algorithms.

Introduction of Apache Spark solved these problems to a great extent. Spark contains a graph computation library called GraphX which simplifies our life. In-memory computation along with in-built graph support improves the performance of the algorithm by a magnitude of one or two degrees over traditional MapReduce programs. Spark uses a combination of Netty and Akka for distributing messages throughout the executors. Let’s look at some statistics that depict the performance of the PageRank algorithm using Hadoop and Spark.

Iterative Machine Learning Algorithms:

Almost all machine learning algorithms work iteratively. As we have seen earlier, iterative algorithms involve I/O bottlenecks in the MapReduce implementations. MapReduce uses coarse-grained tasks (task-level parallelism) that are too heavy for iterative algorithms. Spark with the help of Mesos – a distributed system kernel, caches the intermediate dataset after each iteration and runs multiple iterations on this cached dataset which reduces the I/O and helps to run the algorithm faster in a fault tolerant manner.

Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverages iterations and yields better results than one pass approximations sometimes used on MapReduce.

Fast data processing. As we know, Spark allows in-memory processing. As a result, Spark is up to 100 times faster for data in RAM and up to 10 times for data in storage.

Iterative processing. Spark’s RDDs allow performing several map operations in memory, with no need to write interim data sets to a disk.

Near real-time processing. Spark is an excellent tool to provide immediate business insights. This is the reason why Spark is used in credit card’s streaming system.

It should be borne in mind that this answer leans heavily on fast, speed-of-thought analytics and ML perspective. Also, I will only consider scenarios that I actually observed people trying to use Spark for.

ETL+wrangling:

Do you have engineers that are capable of writing decent, most importantly, legible Scala? If no, probably stay with Hadoop. Unless there are other, non-technical goals (like a renewed marketing message).

Do you run, by any chance, interactive, selective, speed-of-thought queries (think OLAP)?

If yes, then no, Spark is not a spatial indexer. It can do it, but it will be locked to a “full table scan” solution, which means  it will do it, on average, at a 1000x higher cost than necessary, with the QPS 1000x lower than actually is feasible, for any average pivoting UI scenario.

Oh, but then, there are no true distributed, “big data” MOLAP solutions with a good distributed spatial index scanning in OSS domain today (no, e.g. Impala does not qualify as spatial scan engine), so… maybe; but try commercial vendors perhaps instead. What you gain in software license costs, you most likely will lose on hardware and programming effort tenfold, if you do not.

Either way, MapReduce cannot do it either.

If you only run queries that are always best optimized with a “full table scan” (i.e. low selectivity queries), or do ETL type of things only, sure, go ahead. SparkQL is pretty good for that and is fairly easy.

ML (Machine Learning):

Is the speed or interactivity of the ML computations important?

If yes, then most likely one needs to move beyond Spark. Spark, as far as numerical, iterative, shared-nothing platforms go, is about the slowest platform there is. Practically everything else that exists for that purpose (except for the Hadoop variety of MapReduce of course), has far better strong scaling properties than Spark; even in the free software realm.

There are two main problems with Spark that prevent it form being on par with the segment leaders for performance:  (a) a fine-grain, centralized, heavy-weight task scheduling, and (b) lack of efficient multicast programming model (as in MPI, GraphLab). This is illustrated, for example, here: Large Scale Machine Learning and Other Animals, and is true for any numerical solution that needs to iterate till convergence (which is almost everything there is). This may be a bit dated, but nothing has changed materially to date in this department.

This is especially bad with “super-step” architectures (i.e. GraphX vs. GraphLab).

If no, the described performance issues may be acceptable in your case upon evaluation; but keep in mind that there are much (much!) faster things around here that may collapse your hardware requirements potentially 100x times for a solution of equivalent volume and accuracy.

Are you designing your own ML algorithms?

If yes, you also probably need to move beyond Spark. Semantically, better choices for tensor math include BidMat/BidMach, SystemML and Mahout Samsara.

If you are going to do off-the-shelf work only, then maybe staying with Spark is fine iff your needs are completely covered with what exists today in MLLib or any other math that runs on Spark (there is more than one choice, by now).

Both ML + ETL:

Realistically, the only reason to switch to Spark rather than anything else  for any decision branch that includes ML component is when operational aspects trump performance requirements.

I.e. if you can say, “I am ok with my thing running for hours on dozens of nodes, while it really computes on couple of machines in the same time elsewhere, but i have single machine cluster deployment to care about for all of ETL, batch analytics, and slow-ish, mostly non-interactive ML”, then yes. Switch to Spark.

In either case, ML on Hadoop variety of the Mapreduce is dead. Has been for at least past 3 years.

Streaming maybe? Intra-day metric aggregators? Some online inference even?

Yes, sure. Spark; but it may be worth to look at Flink, since it is said to be a “true streaming” platform, whereas Spark is said to be a microbatch emulation of streaming. In  the end this distinction may not be significant to you enough though. (disclosure: I do not have a first hand experience with the Apache Flink).

That said, as others have mentioned, even if you switch to Spark, you do away with MR ecosystem (MapReduce, Pig, Hive etc.) but not some form of a Hadoop HDFS substrate. Which means, you’d still most likely be heavily married to one of Hadoop distributions out there.

So, the correct statement is MapReduce is being Replaced by Spark and not Hadoop as a whole. IF you are planning to learn Hadoop or make a future in it, please go ahead, it is not dying or dead. Big data is growing fast in leaps and bounds, it’s time to switch. Read also: The Digital World

For more information on Big Data consulting write to me atul@nexcen.in

 

The Digital World

The digital world is expected to grow 10 times in the next 2 years, from 5B connected devices to 50B connected devices and 4T GB of data to 40T GB, by 2020. This 10X leap in such a short span of time is a first of its kind technology transformation. Digital technologies today have already started to change the fundamental behavior of people, customers, industries, companies, products and services. More than 50 percent of high performing companies have been experimenting and learning from early deployments of technological advancements.

With fast developing digital solutions, the challenges of ensuring quality deployment of these digital solutions are also many. Test strategies and approaches across development to deployment need to be relooked at. Solutions evolving via experimenting and yet we are compelled to make rapid deployments to grab competitive advantage. Solutions involve multiple technologies, tools, interfaces and industry standards. Complexity is growing with the involvement of a larger number of eco-system players; theses eco-players are still evolving; some of these eco-system players did not have the need to interact closely before, and are for the first time collaborating and defining roles. Compelling user experience demands an innumerable number of use cases in the complex systems, an ability to adapt to newer interfaces and technologies that will continue to emerge, and so on. These open up tremendous opportunities for us to bring in innovative test services.

Big Data and Cloud Computing are the two amongst few emerging trends and technologies in digital world.

  1. Big Data Analytics is here to resolve 5 Vs: Velocity (Speed), Volume, Value, Variety (Structured, Unstructured, Semi-structured), and Veracity (Conformity & Relevance). With high speed internet becoming more and more handy coupled with falling prices of broadband data connection, AI, Augmented Intelligence with Machine Learning getting wider acceptance across Enterprise IT World, the apprehensions about data security getting busted and low cost of high speed and storage resulting in high volumes of data; all lead to growing demand of Big Data. Read More at NexCen Blogs here: https://nexcenblog.wordpress.com/2018/01/28/big-data-revolution/ or Contact us at consulting@nexcen.in for a delightful experience transforming big data into valuable information. Read our Big Data service offerings here https://nexcen.in/bigdata.php.
  2. Cloud Computing: Simply put, cloud computing is the delivery of computing services—servers, storage, databases, networking, software, analytics and more, over the Internet (“the cloud”) on Pay As You Use basis. Companies offering these computing services are called cloud providers, charge for cloud computing services based on usage. Apart from trading capital expense for variable by assessing usage, can avail cost benefits from hundreds of thousands of customers aggregated in the cloud, IT resources are only ever a click away, which means you reduce the time it takes to make those resources available to your developers from weeks to just minutes. This results in a dramatic increase in agility for the organization, since the cost and time it takes to experiment and develop is significantly lower. You can Go global in minutes by deploying your application in multiple regions around the world with just a few clicks. Contact us at consulting@nexcen.in for our Cloud Computing service offerings or visit here https://nexcen.in/cloud.php.

Big Data Revolution

BigData1As we welcome the new year 2018, I clearly see the emerging trends leading to Big Data. Calling it a revolution as it has changed the whole perspective and paradigm to look and treat the data. With high speed internet becoming more and more handy with masses with FTTH (Fiber To The Home) technology coupled with falling prices of broadband data connection, AI, Augmented Intelligence with Machine Learning getting wider acceptance across Enterprise IT World, the apprehensions about data security getting busted and low cost of high speed and storage resulting in high volumes of data; all lead to growing demand of Big Data.

BigData2

Started with 4Vs (Volume, Velocity, Variety and Veracity) Big Data has come a long way and is still evolving, evolving fast, adding more and more tools and techniques into its ecosystem.  The misconceptions, the myths about Big Data limitations and usage are fast fading out and the world is gearing up to embrace it wholeheartedly.

Let’s talk about the myths about limitations of Big Data:

Myth 1:  Hadoop is only for batch processing. The fact is it does provides real time analytics and interacts well with other real time big data tools like Scala.

Myth 2:  Data Security is not enough. With the evolution, Big data has come a long way and the ecosystem has built enterprise grade security onto Hadoop platforms including Cloudera, Hortonworks, MapR etc. There are now excellent data governance capabilities

Myth 3:  Big Data is for unstructured data only. Big Data widely used for unstructured data, providing alternate strategies and solutions to store huge volumes of unstructured data, providing parallel processing to reduce storage and access time, but is equally good for structured data.

In fact many of these myths are limited to MapReduce usage but as the Big Data evolves, there are plenty of alternate technologies and tools become part of ecosystem. While MapR, a Java-based tool is powerful enough to chomp Big Data and flexible enough to allow for good progress doing so, the coding is anything other than easy. Giving below a few effective and poplular alternatives to MapR:

  1. PigLatin, originally developed by Yahoo to maximize productivity and accommodate a complex procedural data flow, eventually became an Apache project, and has characteristics that resemble both scripting languages (like Python and Pearl) and SQL. In fact, many of the operations look like SQL: load, sort, aggregate, group, join, etc. It just isn’t as limited as SQL. Pig allows for input from multiple databases and output into a single data set.
  2. Hive also looks a lot like SQL at first glance. It accepts SQL-like statements and uses those statements to output Java MapReduce code. It requires little in the way of actual programming, so it’s a useful tool for teams that don’t have high-level Java skills or have fewer programmers with which to produce code. Initially developed by the folks at Facebook, Hive is now an Apache project.
  3. Spark is widely been hailed as the end of MapReduce. Born in the AMPLab at the University of California in Berkley, Spark, unlike Pig and Hive, which are merely programming interfaces for the execution framework, replaces the execution framework of MapReduce entirely. Spark provides most effective memory and resource usage, almost 100 times faster the MapR. Spark also provides many features, including stream processing, data transfer, fast fault recovery, optimized scheduling, and a lot more.

These alternatives have effectively reduced an organization’s dependency on Java.

If you are a late entrant into Big Data space, one benefit is that you won’t have to waddle through all of the platforms and products that came and went during the early years.

Initially it was just Hadoop Distributed File System (HDFS) then came MapReduce, Yarn and a plethora of various products, some of which blossomed and became mature parts of the Hadoop ecosystem. Others petered out or are still puttering around wondering what they’re going to be when they grow up.

BigDataEcosystem

Today, there are quite a number of Big Data products and platforms to pick from to assemble an infrastructure that meets your needs.

Giving below a few significant players, who have made a niche in the Big Data space:

Tez:  A generalized data-flow programming framework, built on Hadoop YARN, Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

Hive:  A Datawarehouse infrastructure that provides data summarization, to query HDFS, Database / support SQL, for Structured DB only. Best for BI Apps.  It can be used independently as well,

PigLatin: A high level data flow language, to query HDFS, shell, can be used independently as well. Best used for ETL

HBase: Scalable distributed database that supports structured data storage for large tables, No Sql database, Database / NoSQL, can be used independently as well. Can have downtime. Speed is fast. Best used for NoSql database

Cassandra:  Scalable multi-master database with no single points of failure, Uses No Sql database, Database / NoSQL, can be used independently as well. Limitation of slow read and write but highly scalable and high availability , Best used for Large/Sensitive NoSQL database.

Mahout:  Scalable machine learning and data mining library, Machine Learning, ML programming framework.

Oozie:  Workflow scheduler system, it is used for managing Hadoop apache jobs, Workflow Management.

Sqoop:   Import/Export utility, to import/export data from Hadoop to various database.

Flume: Robust, mature and proven tool for streaming data. Used to import/export real time data.

Hue: Hadoop user Experience, It is an web interface that supports Hadoop and its ecosystem.

Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Framework can be used independently as well.

Scala: Language in which spark is written. Language, can be used independently as well. Can run unmodified hive queries on existing Hadoop deployment, An alternative to HQL.

Spark streaming: Spark module for performing streaming analytics, Enables analytical and interactive apps for live streaming data. It’s a faster alternative of Flume.

Mlib: Spark module for performing Machine Learning, Machine Learning libraries being built on top of Spark, ML programming framework, can be used independently as well. It is a replacement of Mahout from Hadoop,

GraphX: Graph Computation engine. Combine data parallel and graph parallel concepts, engine.

SparkR: Package for R language to enable R users to leverage spark power from R shell, To access Spark data from R. A package.

PySpark:, To access spark data from Python, It’s  a shell.

Julia: Analytical/Mathematical Language used for Data Modeling/Visualization. It’s a language, can be used independently as well, for structured and unstructured data.

R: Analytical/Mathematical Language used for Data Modeling/Visualization, It’s a language, can be used independently as well, for Structured Database only.

Python: A General Purpose language. Can be used for data modeling. Can be used independently as well, for structured/unstructured data.

While Hive, PigLatin, HBase, Cassandra, Mahout, Oozie, Sqoop, Flume, Hue are Hadoop components,  Many of these tools like R, Python and Julia are originated independently and now part of ecosystem.

For more information contact me at atul@nexcen.in.

Automation To Succeed

Why HR/Payroll/Accounting must be automated

Concept of teamwork
Process (HR/Payroll/Accounting) Automation

Behind every successful business is an effective Human Resource department. HR ensures that employees’ goals are aligned with organization goals to maximize organization outputs by company’s most important asset – People!

HR is responsible for hiring the right candidate, managing employee details, tracking time, handling employee benefits, measuring employee performance, planning succession and much much more. For a small or medium sized business, deployment of an HR software for these activities might seem expensive. In reality, it’s not! Rather it’s a great accelerator. Automating your HR can help you save time and money in a lot of ways. And helps you focus your energies on your core business activities.

An automated payroll system enables the employer to process its payroll through a computerized system making payroll processing simpler, defect free, and faster.

Accounting involves not just making statutory reports, Ledgers, Trial Balance, P&L Statement or Balance Sheet but also filing all the requisite forms and returns to various authorities, in time.

Human Resource, Payroll and Accounting are automated and integrated software products designed specifically to manage the complex Human Resource and Accounting activities involved in an organization. It takes care of the entire Human Capital Management of the company. It helps in automating all the HRMS and Accounting functions like staffing, tests, induction, training, appraisals, attendance and leave management, payroll, accounting, tax filing and communication with candidates and employees etc.

HRIS:

Get all the employee information on your fingertips. You can query on details like the employee name, address, telephone numbers, email address, branch, department, designation, qualifications, experience, birth date, marital status, blood group, allergies, serious illnesses, and other emergency information. So when you want a particular work done, you know exactly whom to get in touch with, in seconds. It can generate HR & Payroll MIS reports; to keep it simple it has approx. 150 reports that can be easily generated. It has a powerful search engine and report designers in all the modules to be able to create any number of reports.

Attendance:

Managing employee attendance is easy since any report can be made in to a MDB file or a text file. These files can then be imported in to spreadsheets for graphical reports and analysis. Overtime and benefits can be easily calculated. Late comings or early goings can be calculated in the salary.

Leave:

Employees can see their leaves status and apply online. Leave authorization can be done online. Attendance entry and shift details are totally automated. Manual calculation of leaves is no longer required. All you have to do is to enter from date and to date for the leave taken, and whether it is CL, SL etc. Software will automatically calculate opening balance of the different types of leaves due at the beginning of the period.

Payroll:

Payroll process in easy steps and quite in-depth and capable of handling any kind of complicated conditions. TDS calculation very accurate and in compliance with the statutory requirements. It calculates increments, arrears and all Payroll and Statutory reports like salary slip, salary register; IT Estimation report, Form 16, e-Form 24 Q, PF / PT / ESIC related reports etc. can also be generated.

Appraisal:

Customize appraisals to each position any time whenever required. Solution will cut down the time taken for appraisals and save on printing and courier costs. Weightages too can be assigned with the KRA and the recommendations can help in identifying the hotshot professional.

While manual systems are slower, error prone (to keep track of all the data) and tardy, automation not just make the processes much simpler, faster and better managed but also effectively eliminates possibilities of defects. However It’s important to maintain the accuracy of the input. Thus, if a terminated employee is due severance pay but the payroll representative neglects to make the entry, the system will not pay it. Typically, the system is reliable so long as the entries are correct.

You can drive following four major benefits out of automation:

download

  1. HR Automation: Automating HR allows you to free up your human resources workers. This doesn’t mean that you won’t need them anymore – automating human resources just means that they will actually be able to do their job much more efficiently. An HR manager who wastes time looking through time-log spreadsheets, files or emails might end up doing nothing else but that! An average manager spends more than 3 hours a week just sorting out employee timesheets. Automation tools help in tracking and calculating employee time, running it through approvals and sending it to payroll or billing. This can also be done in real-time and by configuring multi-level approvals. Time is not just saved, but is made productive. For example, instead of having to spend hours or even full days keying in payroll information, a few simple steps will be all that they have to do in order to complete the payroll process thanks to automating human resources. Instead of mindlessly typing numbers into the system they can focus on other efforts that benefit your company, like designing new HR strategies that can lead your company into the future.
  2. Productivity Boost: Once you begin automating human resources you’ll see an increase in both productivity and profit. Not only are your HR employees free to get more done thanks to automating human resources, they’ll also be more motivated to work. When people feel like they’re really accomplishing goals, they have more moral. And since they’ll get more done in the same amount of time, you’ll get more for your money. Plus, automating human resources is much more precise.
  3. Precision: Precision makes a huge difference. If you take the steps for automating human resources you’ll notice big differences in payroll, attendance, and benefit info. And along with precision comes simplicity. Automating human resources like employee benefits, for example, makes the process of managing and checking health insurance or vacation days as simple as making a few clicks with a mouse. No pestering the HR department, no need to pull files or check forms. All of the information is easily accessible to anybody who needs it thanks to automating human resources.
  4. Make your first impression count: Recent research indicates that an employee decides to stay or leave an organization in the first 90 days. In Fortune 500 companies alone it has been estimated that close to 50% of the outside hires quit in less than 2 years.
  5. Minimize Attrition: Onboarding ensures minimal attrition. However, it includes many forms, induction programs, salary contracts, IT system allocations and new hire training. With automation, you can structure workflows to trigger multiple actions. For instance, automatically send out requests for IT equipment, ID cards and provide employees access to directly add or edit their personal data. This streamlines new hire onboarding and reduces the time taken to induct a new employee.
  6. Security: Security is the final benefit of automating processes, and probably the most important. When automating human resources you can choose to back up your data to online servers, ensuring that in a fire or computer failure you won’t lose years of important information. And you’ll also get security for your company, since human resources errors can lead to tax issues, legal troubles, and unnecessary expenses. The rate of error drops tremendously when automating human resources, and you’ll be able to help your company avoid any issues that may arise in the future, in some cases without even realizing it. Obviously automating human resources is one investment that can help lead your company into the future, and one that you shouldn’t ignore. For the best systems for automating human resources you can trust Unicorn HRO. Automating human resources will never be simpler or more effective.
  7. Employee Empowerment: Every employee has different requirements. Some travel and have to apply for travel requests and submit expense reports. Others may contact the HR constantly to update their personal and professional information. As an organization grows there will be more and more employees that require HR assistance. Managing hundreds of changing employee data and other requests manually can be difficult. Nowadays, HR systems let employees manage all these activities themselves. For instance, an employee applies for an internal training directly on the HR system. HR can create a workflow to automatically add the training details to the employee record when it is completed. This way the employee is empowered and the HR burden is reduced tremendously.
  1. Interact with any third-party system: Automation is key to third party integration. With the help of APIs (Application Programming Interface) and Webhooks, information can be easily exchanged and communicated to any third party application. For example, when a travel expense record is approved an instant notification will be sent to the accounting software to process the reimbursement. It is also useful in cases where organizations use more than one system to manage their HR activities.

Other then these major direct benefits there are plenty of intangible process benefits, naming a few:

  1. HR, Payroll & Accounting software are completely web based, thereby giving tremendous cost benefits. Thus there is no headache on client PCs. No reloading of exes when the package is updated
  2. The Employee / Manager self-service makes the task of approvals quite fast without bothering to go into multiple pages. Some organizations still manage their leave requests through traditional means – emails, word of mouth and sometimes even sticky notes. Now, what if the managerial hierarchy demands approval at multiple levels? Or HR gets left out from the application and approval emails? The entire process becomes time consuming with loads of untraceable requests and approvals.This is why 50% of the global HR workforce today, prefer automated time-offs to update employee records instantly. This can be managed through mobile HR apps as well. Now, that’s automation for you!
  3. All statutory reports and challans of PF, ESIC, Income tax available with E-TDS integration, all available and fully integrated with e-TDS
  4. PF Trust, superannuating and gratuity management
  5. Claim management, very complicated issues of exempted and carry forward able claim heads based on monthly & yearly calculations can be easily done
  6. Salary processing steps, in very easy steps as it is highly automated.
  7. Highly automated MIS

Service Providers like PayBooks, PenSoft, Z-Pay, Ultipro and Sage Peachtree calculate gross-to-net earnings based on the data the payroll representative inputs for SME segment while AON Hewitt, KPMG, Accenture, ADP and Mercer gives you total outsourcing options for complete HR and Payroll ops. Preeminent Business Solutions (PBS) offers complete HR, Payroll and Accounting services that goes beyond usual HR & Payroll support and do all your accounting and tax filings as well.

Automating your HR and Accounting department helps reduce the strain on your HR  and accounting personnel, increase productivity and improve employee participation. Companies all over the globe are increasingly adopting HR automation tools to keep their employees ticking and bring them closer to the organizational goals. It’s time you do too.

The Road to Failure

As per industry estimate, more than 29% of ERP implementations fail to achieve even half the planned business benefits. The implementation problems these ERP systems face are driven by the complexity, risk, and integrated nature of the business processes they automate.

Successful ERPs not only automate the businesses but also take care of the complex requirements with ease giving you time and energy to focus on your core strengths and strategic areas. This is more so because ERP software is finally being made for small businesses and smaller products launched targeting smaller and medium businesses pockets and ease-of-use. Software such as HRP for HR & Payroll has been making their powerful product suitable for a start up company’s needs and budget. ERP systems today touch almost every aspect of a company, so whether it is a completely new system, or just a major upgrade, these common seven pitfalls companies can avoid:

  1. Doing it right in the first place.

Even before implementation the company is dilemma whether they really require it or not. Often large ERP implementation projects fail before they even start. Companies unhappy with their current system become convinced their reporting, integration, or efficiency problems lie in the software they are using. Convinced the grass is greener on the other side of the fence, they embark on a large, risky, and expensive ERP replacement project, when a simple tune-up of their current system, or a small add-on application, such as a better reporting system or employee portal, would address the problem at a fraction of the cost. Even a reimplementation of the same software is usually less costly than switching to another software vendor.

Once an organization makes the decision to implement a new ERP system, the first step is to have a clear definition of success. Often, lack of consensus on the problems being solved, the outcome desired, or the specific financial justification of the project, leads to challenges later controlling the scope and maintaining executive sponsorship. Having a clear destination means defining the important business processes, financial benefits, and deadlines up front and making certain stakeholders agree how to address them. Without a strong definition of success, the end point becomes a moving target.

  1. Mindset

Many companies, big and old enough have a dearth of quality ERP project managers and they put software project managers in-charge of ERP implementations. ERP implementation is a different ballgame from software development. The approach, the team composition and the skill set required are different. Many of them do not even differentiate between Business Analysts and Functional Consultants. Identifying candidates to switch to best practices, Process mapping/design/redesign and change management get a miss and create huge troubles in the end resulting in delays, chaos and failures.

A person experienced in project management makes lot of difference. ERP projects need their own dedicated, experienced project managers. Asking the executive sponsor or the business owner to also manage the project as a part-time adjunct to their main role means neither job will be done well. Not just a scorekeeper, the project manager needs to be an active leader pushing for accountability, transparency, and decisiveness.

  1. Not sticking to the plan

When implementing an ERP solution a series of activities are a must, starting with business process analysis, identifying process changes for better and best practices, identifying and engaging process champions to analyze and design the desired system. This all gets reflected in the plan along with roles and time-lines.

Some businesses don’t want to conduct a business process analysis. This leaves the vendor to guess what features would be most compatible with the company’s unique functions. Without a proper business analysis, it is impossible for the software vendor to know how your business operates. If the vendor doesn’t know how your business operates, it’s impossible for them to give advice or suggest recommended changes best for your business needs and how to appropriately tailor the software. Now is not the time to try to save money by skipping this step.

No matter what vendor you use, the cost of implementing a new business software is too great for your company not to use it to its fullest advantage. It’s a great time to seek guidance on what the best practices are for your business operations and what role your software will play in it. More importantly, this will cause mistakes to be made that will result in delays in the project as the implementer will have to backtrack and ultimately cost the company more money.

A detailed plan is very necessary for successful implementation. All projects start with some kind of plan. However, more times than not, the plan are not realistic, detailed, or specific enough. Companies build a high-level plan with broad assumptions or underestimate the amount of business change involved. Despite how obvious this sounds, it remains the most common mistake companies make. To be a good plan, it needs to identify all the requirements and the people who are going to work on them. It needs to be at a level of detail where a knowledgeable person can visualize the work, usually in work blocks of a few days. It needs to have a logical sequence of tasks, like leaving time in the schedule to fix bugs found in test cycles. Until you have a good plan, you really do not know when the project will end or how much it will cost.

  1. Wrong team composition

Most common blunder to happen is with resources projected. Having a solid understanding of the internal and external resources needed to complete the project is critical. For internal resources, understanding the time commitment needed from business users, typically in the Finance, Accounting, or Human Resources departments, is one of the most commonly underestimated areas. During critical phases of the project, it is often necessary to backfill the majority of transactional employees by bringing in temporary resources. This frees up the users of the new system so they have time for implementation and training. For external resources, having an agreement up-front with your consultants and contractors about the specific duration, skills, and quantity of resources needed is critical.

Too much dependability on consultant can make the team more redundant. Most ERP implementation projects involve consultants, for the expertise, best-practices, and additional resources they bring. While their outside experience is definitely helpful for a project, there is a risk that the company can become over-reliant on the consultants. The company needs to maintain control over the key business decisions, hold the consultants accountable, and have an explicit plan to transfer the knowledge from the consultants to the internal employees when the project is winding down.

Bringing in an outside consultant to help bridge the gap between the executives and the software vendors may sounds like a great idea until you realize the consultant knows nothing about your business, making them useless when it comes to giving the software vendor some background information to better the implementation solution. Again, if you want to get what you’re paying for in the implementation you’re going to want both team members and executives to contribute as much information as needed for a smooth and efficient implementation.

  1. Change Management

The management shouldn’t hurry to start using the tool without adequate training and hand-holding to users. Today’s modern ERP systems are being used by more and more personnel within a company. Beyond the Finance and Accounting departments, modern systems also cover procurement, supply chain functions, compliance, customer relationships, sales, and much more. If the system includes human resources or expense reporting, then essentially all employees use the system. Training hundreds or thousands of users, to the right depth, at just the right time, is no easy task. Leaving training to a small phase at the end of the project makes it very difficult for users to get the training they need to understand the system and have a positive first impression at the rollout.

If ERP systems are the nervous system of a company, then doing an ERP implementation is like brain surgery: only to be attempted if there is a really good reason and not to soon be repeated. Unfortunately, ERP implementation projects often fall victim to some of the same problems of any large, complex project. However, there are some repeatable problems that good planning early in a project can work to avoid.

  1. Document, document, document

Most of the ‘developers’ avoid documenting and get down to coding directly. Even their managers many-a-times push their teams to produce results without proper documentations or signoffs. QA is not for cosmetics only; it brings the two sides on the same pitch, same understanding, gets new innovations. Lack of understanding hits you when/where it hurts most, delaying whole process by bounds vitiating the environment. Every step should be systematic, participative, and measurable backed by prescribed set of templates, checklists and guidelines.

  1. Can’t get it together

Everybody is entitled to their opinion, and when it comes to what software to implement and how it should be implemented everybody in the organization has one. The financing department wants the software that is known for having stronger accounting features but the Sales department wants something with better CRM capabilities. Because of this, it is ideal to make sure everybody, all the stakeholders in the company are on the same page and understands the overall company goals. That way they can cohesively decide which software is best for the big picture and the bottom line.

It’s also best to have a Project Manager (PM) to make sure everyone stays informed about the project and keeps the team up-to-date on the status which will include internal tasks, vendor tasks and the timeline.

Would you like to learn more about ERP implementations for small or medium businesses? Write to me at atul@nexcen.in or consulting@nexcenglobal.com

Big Data Analytics

Big Data Analytics is fast becoming reality and the need for many sectors. The definition of big data holds the key to understanding the big data analysis. According to the Gartner IT Glossary, Big Data is high-volume, high-velocity, and high-variety information assets that demand cost effective, innovative forms of information processing for enhanced insight and decision making. Many factors amounts to need of Big Data analytics, first being high volume of data: sensor and machine-generated data, networks, social media, and much more. Enterprises are awash with terabytes and, increasingly, petabytes of big data. As infrastructure improves along with storage technology, it has become easier for enterprises to store more data than ever before. Second is data being unstructured and unsystematic. Big data extends beyond structured data such as numbers, dates, and strings to include unstructured data such as text, video, audio, click streams, 3D data, and log files. The more sources that data is collected from, the more variety will be found within data assets. Third is the Speed to process huge volume of data. The pace at which data streams in from sources such as mobile devices, click-streams, high-frequency stock trading, and machine-to-machine processes is massive and continuously fast moving. The faster that pace becomes, the more data can be analyzed for discovering new insights.

Like conventional analytics and business intelligence solutions, big data mining and analytics helps uncover hidden patterns, unknown correlations, and other useful business information. However, big data tools can analyze high-volume, high-velocity, and high-variety information assets far better than conventional tools and relational databases that struggle to capture, manage, and process big data within a tolerable elapsed time and at an acceptable total cost of ownership.

 Big Data Analytics is here now

Big data and big data tools offer many benefits. The main business advantages of big data generally fall into one of three categories: cost savings, competitive advantage, or new business opportunities.

Cost Savings

Big data tools like Hadoop allow businesses to store massive volumes of data at a much cheaper price tag than a traditional database. Companies utilizing big data tools for this benefit typically use Hadoop clusters to augment their current data warehouse, storing long-term data in Hadoop rather than expanding the data warehouse. Data is then moved from Hadoop to the traditional database for production and analysis as needed. Versatile big data tools can also function as multiple tools at once, saving organizations on the cost of needing to purchase more tools for the same tasks.

Competitive Advantage

According to a survey of 540 enterprise decision makers involved in big data purchases by Webopedia’s parent company QuinStreet, about half of all respondents said they were applying big data and analytics to improve customer retention, help with product development, and gain a competitive advantage. One of the major advantages of big data analytics is that it gives businesses access to data that was previously unavailable or difficult to access. With increased access to data sources such as social media streams and clickstream data, businesses can better target their marketing efforts to customers, better predict demand for a certain product, and adapt marketing and advertising messaging in real-time. With these advantages, businesses are able to gain an edge on their competitors and act more quickly and decisively when compared to what rival organizations do. Needless to say, a business that effectively utilizes big data analytics tools will be much better prepared for the future than one that doesn’t understand how important those tools are.

New Business Opportunities

The final benefit of big data analytics tools is the possibility of exploring new business opportunities. Entrepreneurs have taken advantage of big data technology to offer new services in AdTech and MarketingTech. Mature companies can also take advantage of the data they collect to offer add-on services or to create new product segments that offer additional value to their current customers. In addition to those benefits, big data analytics can pinpoint new or potential audiences that have yet to be tapped by the enterprise. Finding whole new customer segments can lead to tremendous new value.

These are just a few of the actionable insights made possible by available big data analytics tools. Whether an organization is looking to boost sales and marketing results, uncover new revenue opportunities, improve customer service, optimize operational efficiency, reduce risk, improve security, or drive other business results, big data insights can help.

Use cases for big data analysis

Big data analytics lends itself well to a large variety of use cases spread across multiple industries. Financial institutions can quickly find that big data analysis is adept at identifying fraud before it becomes widespread, preventing further damage. Governments have turned to big data analytics to increase their security and combat outside cyber threats. The healthcare industry uses big data to improve patient care and discover better ways to manage resources and personnel. Telecommunications companies and others utilize big data analytics to prevent customer churn while also planning the best ways to optimize new and existing wireless networks. Marketers have quite a few ways they can use big data. One involves sentiment analysis, where marketers can collect data on how customers feel about certain products and services by analyzing what consumers post on social media sites like Facebook and Twitter.

The number of use cases are plentiful, and no industry should think that analytics couldn’t be used in some way to improve their businesses. That type of versatility is part of what has made big data so popular. And these are only a few examples of use cases. As companies and other organizations become more familiar with all of the capabilities granted through big data analytics, more use cases will likely be discovered, adding to big data’s overall value. As with any developing technology, the process may take some time, but eventually its widespread use will lead to the discovery of even more benefits and uses.

BigDataUsage

Top Big Data Tools Overview:

Apache Hadoop

Hadoop is an open source software framework originally developed by Doug Cutting and Mike Cafarella in 2006. It was specifically built to handle very large data sets. Hadoop is made up of two main parts: the Hadoop Distributed File System (HDFS) and MapReduce. HDFS is the storage component of Hadoop. Hadoop stores data by splitting files into large blocks and distributing it across nodes. MapReduce is the processing engine of Hadoop. Hadoop processes data by delivering code to nodes to process in parallel.

Apache Spark

Apache Spark is quickly growing as a data analytics tool. It is an open source framework for cluster computing. Spark is frequently used as an alternate to Hadoop’s MapReduce because it it is able to analyze data up to 100 times faster for certain applications. Common use cases for Apache Spark include streaming data, machine learning and interactive analysis.

Apache Hive

Apache Hive is a SQL-on-Hadoop data processing engine. Apache Hive excels at batch processing of ETL jobs and SQL queries. Hive utilizes a query language called HiveQL. HiveQL is based on SQL, but does not strictly follow the SQL-92 standard.

NoSQL Databases

NoSQL databases have grown in popularity. These Not Only SQL databases are not bound by traditional schema models allowing them to collect unstructured datasets. The flexibility of NoSQL databases like MongoDB, Cassandra HBase make them a popular option for big data analytics.

Column-oriented databases

Traditional, row-oriented databases are excellent for online transaction processing with high update speeds, but they fall short on query performance as the data volumes grow and as data becomes more unstructured. Column-oriented databases store data with a focus on columns, instead of rows, allowing for huge data compression and very fast query times. The downside to these databases is that they will generally only allow batch updates, having a much slower update time than traditional models.

Schema-less databases, or NoSQL databases

There are several database types that fit into this category, such as key-value stores and document stores, which focus on the storage and retrieval of large volumes of unstructured, semi-structured, or even structured data. They achieve performance gains by doing away with some (or all) of the restrictions traditionally associated with conventional databases, such as read-write consistency, in exchange for scalability and distributed processing.

MapReduce

This is a programming paradigm that allows for massive job execution scalability against thousands of servers or clusters of servers. Any MapReduce implementation consists of two tasks:

  • The “Map” task, where an input dataset is converted into a different set of key/value pairs, or tuples;
  • The “Reduce” task, where several of the outputs of the “Map” task are combined to form a reduced set of tuples (hence the name).

Hadoop

Hadoop is by far the most popular implementation of MapReduce, being an entirely open source platform for handling Big Data. It is flexible enough to be able to work with multiple data sources, either aggregating multiple sources of data in order to do large scale processing, or even reading data from a database in order to run processor-intensive machine learning jobs. It has several different applications, but one of the top use cases is for large volumes of constantly changing data, such as location-based data from weather or traffic sensors, web-based or social media data, or machine-to-machine transactional data.

Hive

Hive is a “SQL-like” bridge that allows conventional BI applications to run queries against a Hadoop cluster. It was developed originally by Facebook, but has been made open source for some time now, and it’s a higher-level abstraction of the Hadoop framework that allows anyone to make queries against data stored in a Hadoop cluster just as if they were manipulating a conventional data store. It amplifies the reach of Hadoop, making it more familiar for BI users.

PIG

PIG is another bridge that tries to bring Hadoop closer to the realities of developers and business users, similar to Hive. Unlike Hive, however, PIG consists of a “Perl-like” language that allows for query execution over data stored on a Hadoop cluster, instead of a “SQL-like” language. PIG was developed by Yahoo!, and, just like Hive, has also been made fully open source.

WibiData

WibiData is a combination of web analytics with Hadoop, being built on top of HBase, which is itself a database layer on top of Hadoop. It allows web sites to better explore and work with their user data, enabling real-time responses to user behavior, such as serving personalized content, recommendations and decisions.Perhaps the greatest limitation of Hadoop is that it is a very low-level implementation of MapReduce, requiring extensive developer knowledge to operate. Between preparing, testing and running jobs, a full cycle can take hours, eliminating the interactivity that users enjoyed with conventional databases.

PLATFORA

PLATFORA is a platform that turns user’s queries into Hadoop jobs automatically, thus creating an abstraction layer that anyone can exploit to simplify and organize datasets stored in Hadoop.

Storage Technologies

As the data volumes grow, so does the need for efficient and effective storage techniques. The main evolutions in this space are related to data compression and storage virtualization.

SkyTree

SkyTree is a high-performance machine learning and data analytics platform focused specifically on handling Big Data. Machine learning, in turn, is an essential part of Big Data, since the massive data volumes make manual exploration, or even conventional automated exploration methods unfeasible or too expensive

Big Data in the cloud

As we can see, most of these technologies are closely associated with the cloud. Most cloud vendors are already offering hosted Hadoop clusters that can be scaled on demand according to their user’s needs. Also, many of the products and platforms mentioned are either entirely cloud-based or have cloud versions themselves.

Big Data and cloud computing go hand-in-hand. Cloud computing enables companies of all sizes to get more value from their data than ever before, by enabling blazing-fast analytics at a fraction of previous costs. This, in turn drives companies to acquire and store even more data, creating more need for processing power and driving a virtuous circle.

What is big data in the cloud?

Taking big data to the cloud offers up a number of advantages. Improvements come in the form of better performance, targeted cloud optimizations, more reliability, and greater value. Big data in the cloud gives businesses the type of organizational scale many are searching for. This allows many users, sometimes in the hundreds, to query data while only being overseen by a single administrator. That means little supervision is required.

Big data in the cloud also allows organizations to scale quickly and easily. This scaling is done according to the customer’s workload. If more clusters are needed, the cloud can give them the extra boost. During times of less activity, everything can be scaled down. This added flexibility is particularly valuable for companies that experience varying peak times. Big data in the cloud also takes advantage of the benefits of cloud infrastructure, whether they be from Amazon Web Services, Microsoft Azure, Google Cloud Platform, or others.

What are data lakes?

Gathering data from various sources is, of course, only one part of the big data analytics process. All that data needs to be stored somewhere, and that repository is often referred to as a data lake. Data lakes are where data is kept in its raw form, before any organizational structure is used and before any analytics are performed. Data lakes don’t use the traditional structure of files or folders but rather use a flat architecture where each element has its own identifier, making it easy to find when queried. Data lakes are a type of object storage that Hadoop uses, making it an effective way to describe where Hadoop-supported platforms pull their data from. One major benefit of having a data lake is the ability to store massive amounts of data. As big data continues to grow, the need for that near limitless storage capability has grown with it. Data lakes also allow for added processing power while also providing the ability to handle numerous jobs at the same time. These are all capabilities that have been increasingly in demand as more enterprises use big data analytics tools.

How NexCen supports big data analytics?

From simple spreadsheets to advanced analytics and marketing solutions to analytics engines, leading self-service big data platform provides effortless integration to centrally analyze your data all in one spot.

Spreadsheets and Analytics Tools: Through ODBC connectors, NexCen customers can connect to Microsoft Excel and tools from leading analytics vendors such as Tableau, Qlik, MicroStrategy, and TIBCO Jaspersoft. In addition, the R statistical programming language can be integrated ODBC/REST APIs.

Analytics Engines: NexCen offers connectors for massively parallel processing databases such as Vertica as well as relational database engines such as Microsoft SQL Server and the MySQL open source database, and NoSQL databases such as MongoDB.

CRM and Online Marketing Solutions: NexCen also connects to leading CRM and online marketing platforms such as Salesforce.com and online marketing and web analytics solutions such as Omniture and Google Analytics.

NexCen ways of Processing Big Data

unified interface for performing the myriad of use cases and workloads that a data driven organization will face ranging from ad hoc analysis, predictive analysis, machine learning, streaming and MapReduce to name a few. Users without software development skills can leverage the QDS workbench through our SmartQuery interface without even knowing how to write a SQL query.

Workbench

The NexCen Workbench enables data scientists and analysts to drive their ah-hoc workloads through their processing engines of choice using an easy-to-use SQL query composer or SmartQuery builder tool.

Custom Applications

Developers can configure their applications to drive various workloads using a number of common language options.

BI & Visualization Systems

Drive QDS workload using industry-leading, ODBC compliant BI & Visualization tools, including: Tableau, Birst, Qlik, Pentaho etc

Looking for more information on how to be effective with big data analytics tools? Write to us at consulting@nexcen.in