We understand ORC will be the new compression standard going forward in Hadoop but we had a customer that cannot upgrade to the new version of Flume yet that will support writing directly into a ORC Hive table so we helped them implement Snappy compression in their Cloudera cluster for now. Pretty good results and it will save them money in the long run.
$ hadoop fs -du -h -s /db/prod_live/log_data/year=2015/month=05/day=20
75.3 G /db/prod_live/log_data/year=2015/month=05/day=20
The current day was taking up 75 G of data with 474 million rows.
hive> select count(*) from log_data where year = “2015" and month = “05" and day = “20";
We set Hive to create data using Snappy:
insert into table log_data_snappy partition(year,month,day) select * from log_data where year = “2015" and month = “05" and day = “20";
$ hadoop fs -du -h -s /db/prod_live/log_data_snappy
22.0 G /db/prod_live/log_data_snappy
The same data is now taking up 22G. Pretty good compression for the type of data they are using.
hive> select count(*) from log_data_snappy where year = “2015" and month = “05" and day = “20";
This query just proves we have the same amount of rows after the compression.
This presentation in a few years old but still provides some good knowledge: http://www.slideshare.net/ydn/hug-compression-talk
The future will be ORC compression. More information can be found on the Hortonworks site: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
Since our customer uses Flume to ingest their data directly into HDFS, they had to change the following two lines in their Flume config:
a1_agent.sinks.k1.hdfs.codeC = snappy
a1_agent.sinks.k1.hdfs.fileType = CompressedStream
There are no changes to Hive to query this data. Hive will see the files end with a .snappy and will take care of decompressing them. If you want to read these files from a command line, you just use the -text flag:
hadoop fs -text /path/file.snappy