Microsoft Azure vs. Bit Refinery

In keeping up with our “how do we compare" series, today we’ll look at Microsoft Azure. Azure is very similar to Amazon EC2 where you are provided “instances" that include a preconfigured set of resources such as cores, ram and storage. Much like Amazon, there are other services such as databases, storage, BizTalk and numerous others. This is very handy if you want to provide a service/product within your company but don’t want the hassles of managing the software. (ie. updates, backups,etc..)

Microsoft launched HDInsight which provides a Hortonworks base install along with other components. You purchase “instances" and instead of traditional local storage, you are instructed to use “Blob" storage. Here is an article [ http://bit.ly/1JhTZHV ] that explains how the storage works and attempts to justify using it over traditional local storage.

Here again we have a company that started their hosting business with elasticity in mind. Smaller servers that you can quickly scale up and down when needed. This works great for development, QA and production scenarios where you get a spike in traffic but when it comes to Hadoop, not so much. The cost and lack of performance will quickly add up.

We decided to attempt to price out a 20 node cluster using 3 masters nodes and 20 data nodes along with an edge and firewall node. The results aren’t pretty much like our Amazon EC2 comparison.

We started by choosing an inferior “A7" instance which only had 8 cores and 56GB of Ram compared to our base offering of dual hex cores and 64GB of Ram. From there, we needed to match our 10TB of storage per data node so we chose the “Page Blog & Disks" per a Azure sales representative.

As we said, the results are not pretty. We’re not sure why anyone would build a cluster that goes against the exact nature of why Hadoop was conceived:

Massive processing using commodity hardware

but then again, we’re still scratching our heads on the terms “Hadoop on Windows" , “Windows 8" and of course “Windows Vista"…

azure-20node-cluster

 

 

Hadoop as a Service – HaaS

We often get asked if we are a “Hadoop as a Service" (also knows as HaaS) company. Although we provide infrastructure that our customers run Hadoop off of, we do not provide a point-n-click Hadoop product. In this post, we’ll go over the differences between HaaS and what Bit Refinery provides which is Big Data IaaS. (Infrastructure as a Service)

HaaS companies offer a “fully baked" version of Hadoop. It is usually their own version of Hadoop that follows closely to the original Apache Hadoop version. Some benefits of HaaS offerings are:

  • Managed Hadoop – No need to hire a sysadmin
  • Ease of use – Built to get started quickly
  • No hardware/infrastructure – Just add/remove servers as you need it. No Capex
  • Support – Each company has a team of Hadoop experts to help when needed

Bit Refinery believes there is plenty of room out there for all types of Big Data companies. However, we also believe there are lots of companies out there that want to “own" their own data and not be a victim of “data lock-in" as we call it.

Data Lock-In

Data Lock-In is where a company has a substantial amount of your data and it’s very difficult to get some or all of it back. They impose fees on data transfer out of their clouds even if you could transfer it in a decent timeframe.

Bit Refinery believes that as we go forward in this Big Data future together, you should own your own data. Period. If you want to retrieve some or all of it, we fully support “data shutting" which is the method of shipping a large storage device (NAS) to one of our data centers and we’ll hook it right up to your infrastructure. You can even use this method to take periodic snapshots of your data.

With more and more companies choosing to have one or two very large companies holding the keys to one of their core assets (their data), it’s no wonder people are looking for an alternative. Sure, these companies offer a rich ecosystem of tools but in the end is it worth it to have all of your eggs in one basket?

Another aspect where Bit Refinery shines is cost. We know cost isn’t everything and you usually “get what you pay for" so why is this such a differentiator with us? A few years ago, we looked at why Hadoop was created in the first place and this was an easy one. Hadoop was created to run on commodity hardware and not $20,000 Cisco servers.

True Commodity Hardware

We see companies similar to ours and even HaaS companies trying to host Big Data applications on servers that were made for running mission critical applications with dual power supplies, gold plated racks and switch gear. Now, don’t get us wrong, we don’t use Netgear switches and Walmart servers but we do use commodity equipment that is very stable and reliable. (good read: Building a computer the Google way) The same servers we offer for $300/mo are over $1,000/mo with Amazon! (64GB Ram, 10TB and dual 6 core processors)

So, it really comes down to flexibility and price. HaaS providers offer a smooth experience when using Hadoop but you give up flexibility and price. With Bit Refinery, you can use any flavor of Hadoop or any other Big Data application with a much lower cost point than any other provider. Plus, you don’t get “DATA LOCK-IN!".

Bzip2 Compression in Cloudera

Following up on  our previous post, we wanted to show how easy it is to use bzip2 compression as well. Here is a great article detailing out the different compression types. (It’s a bit outdated. ie. missing ORC)

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=600;
set hive.exec.max.created.files=250000;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec;
SET mapred.output.compression.type=BLOCK;
insert into table log_data_bzip2 partition(year,month,day) select * from log_data where year = “2015" and month = “05" and day = “20";

$ hadoop fs -du -h -s /db/prod_live/log_data_bzip2
9.5 G /db/prod_live/log_data_bzip2

The Snappy compression on the same table was 22G so Bzip2 compresses quite a bit more.

Snappy Compression in Cloudera

We understand ORC will be the new compression standard going forward in Hadoop but we had a customer that cannot upgrade to the new version of Flume yet that will support writing directly into a ORC Hive table so we helped them implement Snappy compression in their Cloudera cluster for now. Pretty good results and it will save them money in the long run.

$ hadoop fs -du -h -s /db/prod_live/log_data/year=2015/month=05/day=20

75.3 G  /db/prod_live/log_data/year=2015/month=05/day=20

The current day was taking up 75 G of data with 474 million rows.

hive> select count(*) from log_data where year = “2015" and month = “05" and day = “20";

474,875,765

We set Hive to create data using Snappy:

set hive.exec.dynamic.partition.mode=nonstrict;

set hive.exec.max.dynamic.partitions.pernode=600;

set hive.exec.max.created.files=250000;

SET hive.exec.compress.output=true;

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

SET mapred.output.compression.type=BLOCK;

insert into table log_data_snappy partition(year,month,day) select * from log_data where year = “2015" and month = “05" and day = “20";

 

$ hadoop fs -du -h -s /db/prod_live/log_data_snappy

22.0 G  /db/prod_live/log_data_snappy

The same data is now taking up 22G. Pretty good compression for the type of data they are using.

hive> select count(*) from log_data_snappy where year = “2015" and month = “05" and day = “20";

474,875,765

This query just proves we have the same amount of rows after the compression.

This presentation in a few years old but still provides some good knowledge: http://www.slideshare.net/ydn/hug-compression-talk

The future will be ORC compression. More information can be found on the Hortonworks site: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

Since our customer uses Flume to ingest their data directly into HDFS, they had to change the following two lines in their Flume config:

a1_agent.sinks.k1.hdfs.codeC = snappy

a1_agent.sinks.k1.hdfs.fileType = CompressedStream

There are no changes to Hive to query this data. Hive will see the files end with a .snappy and will take care of decompressing them. If you want to read these files from a command line, you just use the -text flag:

hadoop fs -text /path/file.snappy

Hadoop Jobs

We attended the Hadoop Summit 2015 as an exhibitor and had a great time meeting folks from around the world and from different companies. Outside the main vendor hall, there was a large bulletin board set up for people to post jobs, etc..

As the days went on, this board got larger and larger and although there were duplicate posts by the same company, it was amazing how many jobs were listed up there. It goes to show you that Big Data is here to stay and there are some great opportunities out there.