Bzip2 Compression in Cloudera

Following up on ¬†our previous post, we wanted to show how easy it is to use bzip2 compression as well. Here is a great article detailing out the different compression types. (It’s a bit outdated. ie. missing ORC)

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=600;
set hive.exec.max.created.files=250000;
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
insert into table log_data_bzip2 partition(year,month,day) select * from log_data where year = “2015" and month = “05" and day = “20";

$ hadoop fs -du -h -s /db/prod_live/log_data_bzip2
9.5 G /db/prod_live/log_data_bzip2

The Snappy compression on the same table was 22G so Bzip2 compresses quite a bit more.