Snappy Compression in Cloudera

We understand ORC will be the new compression standard going forward in Hadoop but we had a customer that cannot upgrade to the new version of Flume yet that will support writing directly into a ORC Hive table so we helped them implement Snappy compression in their Cloudera cluster for now. Pretty good results and it will save them money in the long run.

$ hadoop fs -du -h -s /db/prod_live/log_data/year=2015/month=05/day=20

75.3 G  /db/prod_live/log_data/year=2015/month=05/day=20

The current day was taking up 75 G of data with 474 million rows.

hive> select count(*) from log_data where year = “2015" and month = “05" and day = “20";

474,875,765

We set Hive to create data using Snappy:

set hive.exec.dynamic.partition.mode=nonstrict;

set hive.exec.max.dynamic.partitions.pernode=600;

set hive.exec.max.created.files=250000;

SET hive.exec.compress.output=true;

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

SET mapred.output.compression.type=BLOCK;

insert into table log_data_snappy partition(year,month,day) select * from log_data where year = “2015" and month = “05" and day = “20";

 

$ hadoop fs -du -h -s /db/prod_live/log_data_snappy

22.0 G  /db/prod_live/log_data_snappy

The same data is now taking up 22G. Pretty good compression for the type of data they are using.

hive> select count(*) from log_data_snappy where year = “2015" and month = “05" and day = “20";

474,875,765

This query just proves we have the same amount of rows after the compression.

This presentation in a few years old but still provides some good knowledge: http://www.slideshare.net/ydn/hug-compression-talk

The future will be ORC compression. More information can be found on the Hortonworks site: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

Since our customer uses Flume to ingest their data directly into HDFS, they had to change the following two lines in their Flume config:

a1_agent.sinks.k1.hdfs.codeC = snappy

a1_agent.sinks.k1.hdfs.fileType = CompressedStream

There are no changes to Hive to query this data. Hive will see the files end with a .snappy and will take care of decompressing them. If you want to read these files from a command line, you just use the -text flag:

hadoop fs -text /path/file.snappy