Hadoop as a Service – HaaS

We often get asked if we are a “Hadoop as a Service" (also knows as HaaS) company. Although we provide infrastructure that our customers run Hadoop off of, we do not provide a point-n-click Hadoop product. In this post, we’ll go over the differences between HaaS and what Bit Refinery provides which is Big Data IaaS. (Infrastructure as a Service)

HaaS companies offer a “fully baked" version of Hadoop. It is usually their own version of Hadoop that follows closely to the original Apache Hadoop version. Some benefits of HaaS offerings are:

  • Managed Hadoop – No need to hire a sysadmin
  • Ease of use – Built to get started quickly
  • No hardware/infrastructure – Just add/remove servers as you need it. No Capex
  • Support – Each company has a team of Hadoop experts to help when needed

Bit Refinery believes there is plenty of room out there for all types of Big Data companies. However, we also believe there are lots of companies out there that want to “own" their own data and not be a victim of “data lock-in" as we call it.

Data Lock-In

Data Lock-In is where a company has a substantial amount of your data and it’s very difficult to get some or all of it back. They impose fees on data transfer out of their clouds even if you could transfer it in a decent timeframe.

Bit Refinery believes that as we go forward in this Big Data future together, you should own your own data. Period. If you want to retrieve some or all of it, we fully support “data shutting" which is the method of shipping a large storage device (NAS) to one of our data centers and we’ll hook it right up to your infrastructure. You can even use this method to take periodic snapshots of your data.

With more and more companies choosing to have one or two very large companies holding the keys to one of their core assets (their data), it’s no wonder people are looking for an alternative. Sure, these companies offer a rich ecosystem of tools but in the end is it worth it to have all of your eggs in one basket?

Another aspect where Bit Refinery shines is cost. We know cost isn’t everything and you usually “get what you pay for" so why is this such a differentiator with us? A few years ago, we looked at why Hadoop was created in the first place and this was an easy one. Hadoop was created to run on commodity hardware and not $20,000 Cisco servers.

True Commodity Hardware

We see companies similar to ours and even HaaS companies trying to host Big Data applications on servers that were made for running mission critical applications with dual power supplies, gold plated racks and switch gear. Now, don’t get us wrong, we don’t use Netgear switches and Walmart servers but we do use commodity equipment that is very stable and reliable. (good read: Building a computer the Google way) The same servers we offer for $300/mo are over $1,000/mo with Amazon! (64GB Ram, 10TB and dual 6 core processors)

So, it really comes down to flexibility and price. HaaS providers offer a smooth experience when using Hadoop but you give up flexibility and price. With Bit Refinery, you can use any flavor of Hadoop or any other Big Data application with a much lower cost point than any other provider. Plus, you don’t get “DATA LOCK-IN!".

Bzip2 Compression in Cloudera

Following up on  our previous post, we wanted to show how easy it is to use bzip2 compression as well. Here is a great article detailing out the different compression types. (It’s a bit outdated. ie. missing ORC)

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=600;
set hive.exec.max.created.files=250000;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec;
SET mapred.output.compression.type=BLOCK;
insert into table log_data_bzip2 partition(year,month,day) select * from log_data where year = “2015" and month = “05" and day = “20";

$ hadoop fs -du -h -s /db/prod_live/log_data_bzip2
9.5 G /db/prod_live/log_data_bzip2

The Snappy compression on the same table was 22G so Bzip2 compresses quite a bit more.

Snappy Compression in Cloudera

We understand ORC will be the new compression standard going forward in Hadoop but we had a customer that cannot upgrade to the new version of Flume yet that will support writing directly into a ORC Hive table so we helped them implement Snappy compression in their Cloudera cluster for now. Pretty good results and it will save them money in the long run.

$ hadoop fs -du -h -s /db/prod_live/log_data/year=2015/month=05/day=20

75.3 G  /db/prod_live/log_data/year=2015/month=05/day=20

The current day was taking up 75 G of data with 474 million rows.

hive> select count(*) from log_data where year = “2015" and month = “05" and day = “20";

474,875,765

We set Hive to create data using Snappy:

set hive.exec.dynamic.partition.mode=nonstrict;

set hive.exec.max.dynamic.partitions.pernode=600;

set hive.exec.max.created.files=250000;

SET hive.exec.compress.output=true;

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

SET mapred.output.compression.type=BLOCK;

insert into table log_data_snappy partition(year,month,day) select * from log_data where year = “2015" and month = “05" and day = “20";

 

$ hadoop fs -du -h -s /db/prod_live/log_data_snappy

22.0 G  /db/prod_live/log_data_snappy

The same data is now taking up 22G. Pretty good compression for the type of data they are using.

hive> select count(*) from log_data_snappy where year = “2015" and month = “05" and day = “20";

474,875,765

This query just proves we have the same amount of rows after the compression.

This presentation in a few years old but still provides some good knowledge: http://www.slideshare.net/ydn/hug-compression-talk

The future will be ORC compression. More information can be found on the Hortonworks site: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

Since our customer uses Flume to ingest their data directly into HDFS, they had to change the following two lines in their Flume config:

a1_agent.sinks.k1.hdfs.codeC = snappy

a1_agent.sinks.k1.hdfs.fileType = CompressedStream

There are no changes to Hive to query this data. Hive will see the files end with a .snappy and will take care of decompressing them. If you want to read these files from a command line, you just use the -text flag:

hadoop fs -text /path/file.snappy

Hadoop Jobs

We attended the Hadoop Summit 2015 as an exhibitor and had a great time meeting folks from around the world and from different companies. Outside the main vendor hall, there was a large bulletin board set up for people to post jobs, etc..

As the days went on, this board got larger and larger and although there were duplicate posts by the same company, it was amazing how many jobs were listed up there. It goes to show you that Big Data is here to stay and there are some great opportunities out there.

Hadoop Summit 2015

We attended our first Hadoop Summit as an exhibitor in June and it was amazing. There were over 4,000 people attending this year and many new companies and technologies were introduced.

People that came by and visited us were impressed on how we compare with Amazon. Not only for the extreme pricing difference but we also offer managed services and Hadoop Jumpstart packages. Although having Bit Refinery handle your infrastructure isn’t for every company, it was very apparent that more and more companies are getting out of the data center and infrastructure business and letting the “pros" handle it.

Here are some photos from the event: