Using Big Data for Good, Not Evil…

In a day where some people think of Big Data is used by large companies to mine customer buying habits and figure out how to market them better, there are other folks using Big Data in other ways.

A great example is this article from Ben Wellington. He took parking ticket data and joined it up with Google Street View to show lots of these tickets were given out illegally. The person getting the ticket was actually parked LEGALLY.

Providing this level of data to anyone that wants to analyze it is a great thing. With the new tools to capture and analyze large data sets now available (mostly for free), we will continue to see articles like this in the future.

-Bit Refinery


Murder in the Amazon cloud

We have been telling our customers for years to always keep copies of their important documents, files, databases,etc.. on a backup system outside of the Bit Refinery infrastructure. You never know what can happen.

In this article from Infoworld, it explains how a hacker gained access to the AWS console for a company and essentially deleted all of their servers including backups putting them out of business.

If you are currently with Amazon or one of the “big hosting companies", contact us so we can advise you on a real disaster recovery strategy.

Although Easter was just last week, we all know you should never have “all your eggs in once basket"…

– Bit Refinery Team

Smokeping Traceroute Alerts

Smokeping is an excellent monitoring tool used to test and alert on latency between servers. It’s an open source tool that has been around for a while.

Here at Bit Refinery, we are often tasked with helping a customer that is encountering latency or disconnects between infrastructure hosted in our environment and their locations. Sometimes the problem is with one of our many upstream providers and sometimes it’s with theirs.

We use Smokeping in addition to numerous other monitoring tools to help provide insight in the health of our network but often times when things do go bad, it’s hard to determine where the issue is. This is where Smokeping can be used to send an alert when a latency threshold is met. You can use the built in alerting or call a custom script. We wanted to share with you a very basic script we use to run a “mtr" command and email that so our NOC and have more details on the issue.

In the /etc/smokeping/config.d/Alerts script, replace the “to" line with this line to call a custom script named

to = |/etc/smokeping/config.d/ 2> /tmp/trace.log

The was put together very quickly and needs some work but it works well enough for now:

# Script to email a mtr report on alert from Smokeping #


if [ “$losspattern" = “loss: 0%" ];
subject="Clear-${smokename}-Alert: $target host: ${hostname}"
subject="${smokename}Alert: ${target} – ${hostname}"

echo “MTR Report for hostname: ${hostname}" > /tmp/mtr.txt
echo “" >> /tmp/mtr.txt
echo “sudo mtr -n –report ${hostname} "
sudo /usr/sbin/mtr -n –report ${hostname} >> /tmp/mtr.txt

echo “" >> /tmp/mtr.txt
echo “Name of Alert: " $alertname >> /tmp/mtr.txt
echo “Target: " $target >> /tmp/mtr.txt
echo “Loss Pattern: " $losspattern >> /tmp/mtr.txt
echo “RTT Pattern: " $rtt >> /tmp/mtr.txt
echo “Hostname: " $hostname >> /tmp/mtr.txt
echo “" >> /tmp/mtr.txt
echo “Full mtr command is: sudo /usr/sbin/mtr -n –report ${hostname}" >> /tmp/mtr.txt

echo “subject: " $subject
if [ -s /tmp/mtr.txt ] then
mailx -s “${subject}" $email

You need mailx installed and the script needs to be owned and executable by the smokeping user. Also, in order to call sudo, you must have the following line in your /etc/sudoers:

smokeping ALL=NOPASSWD:/usr/sbin/mtr

And comment out the “Defaults requiretty" line.

That’s about it. Here is an example of what the email looks like:

MTR Report for hostname: xx.xx.xx.xx

HOST: smokeping-host Loss%   Snt   Last   Avg  Best  Wrst StDev

  1. 0.0%    10    0.3   0.3   0.2   0.4   0.0
  2. x.x.x.x 0.0%    10    0.5   3.3   0.5  11.2   3.7
  3. x.x.x.x 0.0%    10    0.4   2.8   0.4  10.9   3.6
  4. x.x.x.x          0.0%    10    1.0   1.1   1.0   1.1   0.0
  5. x.x.x.x 0.0%    10    2.0   2.0   1.9   2.3   0.1
  6. x.x.x.x 0.0%    10   26.0  26.0  25.9  26.1   0.0
  7. x.x.x.x 0.0%    10   26.0  25.9  25.8  26.0   0.1
  8. x.x.x.x 0.0%    10   24.0  24.1  24.0  24.4   0.2
  9. x.x.x.x 0.0%    10   50.6  49.5  47.7  50.9   1.2
  10. x.x.x.x 0.0%    10   47.7  48.9  47.1  50.8   1.3
  11. x.x.x.x 0.0%    10   47.6  49.0  47.1  51.0   1.4
  12. x.x.x.x 0.0%    10   49.0  49.8  48.9  50.7   0.6
  13. x.x.x.x 0.0%    10   52.9  53.2  52.3  54.4   0.8
  14. x.x.x.x 0.0%    10   57.5  57.0  55.7  58.9   0.9
  15. x.x.x.x 0.0%    10   55.2  53.7  53.3  55.2   0.6
  16. x.x.x.x 0.0%    10   51.9  52.0  51.8  52.4   0.2
  17. x.x.x.x         20.0%    10   59.4  59.9  59.4  61.0   0.7

Name of Alert:  packetloss

Target:  Smokeping Target Name

Loss Pattern:  loss: 30%

RTT Pattern:  rtt: 60ms

Hostname: x.x.x.x

Full mtr command is: sudo /usr/sbin/mtr -n –report x.x.x.x

There is still some work to do on the script but for those looking for a quick solution to add more details to latency alerts, this script really helps us out.

– Bit Refinery Team

Hadoop for Datawarehouse Offloading

We’ve heard a lot lately about using Hadoop for “Data Warehouse Offloading” with folks talking about it and how it can save a company money. We wanted to share a success story for a customer of ours that has implemented an offloading project allowing them to save money while opening up a new world of possible analytics.

We cannot share their name but they are a major retail organization with over 450 stores nationwide. Each store has a IBM iSeries server that provides data to the POS (Point of Sale) systems. Back at corporate, they have a much larger iSeries that stores all the data as well as some internal information such as employee and product data. Their current data warehouse sits on Teradata with Oracle providing some master data management functions.

The below diagram depicts at a high level their infrastructure before adding Hadoop. They rely heavily on external data sources for surveys, loyalty programs, and other marketing data.


The IBM iSeries servers acts as the main processing unit in the company with the Teradata warehouse appliance for reporting and analysis. The main problem the IT department struggled with is the ongoing cost and performance of the Teradata appliance and their continued data growth. An analysis was completed on the data in the warehouse and it was determined that over 80% of the data was stagnant. The data was brought into the warehouse, aggregated into “aggregation” tables and these tables were the only ones included in standard and adhoc reporting.


  • Continually rising cost of the Teradata appliance
  • Performance issues with deep analysis queries against years of data
  • Potential data gold mines in the stagnant data in the warehouse

This company had looked at Hadoop before and dismissed it because they didn’t consider themselves a “Big Data” company. They had recently uploaded some data into Google and liked the features and performance but knew this wasn’t a long-term solution as they still wanted control over their data due to security concerns. On the other hand, their data center is aging and there was a push to move their current equipment to a managed data center to reduce cost and increase stability.

They chose a small 4 node Hortonworks cluster hosted by Bit Refinery. This provided them with around 10TB of usable storage which allowed space to grow.  This helped slow down the data being placed on Teradata and gave them immediate financial relief. (Additional TB of Teradata is around $15k)

Total cost of the cluster was $2,000/mo and they can add data nodes for $300/mo as they continue to use Hadoop to store their low-level data and push it back up to Teradata which supplies excellent reporting capabilities.


The new architecture has raw data from external sources along with the iSeries data landing directly in Hadoop, aggregated and then sent to Teradata. This allows reports to run quickly in Teradata and for deep analysis to be performed on the low level, raw data in Hadoop.

Since the Hadoop cluster is hosted at Bit Refinery, their internal network at the corporate office was extended directly to their hosted cluster at Bit Refinery via layer 2. This ensures total control over the data as well as bandwidth usage. Data is encrypted at rest and during transfer and since the Hadoop data nodes are dedicated to them, security concerns were quickly laid to rest

We hope this article provides a real-world example of how a company that doesn’t necessarily have a “Big Data” problem uses Hadoop to not only save money on their IT infrastructure but now has plan to house all of their low-level data into a cost effective data store that allows for deep analysis of the data.

Want more information? Here is a good talk from the Hadoop Summit about how a company saved $30 million using this technique.

Have questions for us? Contact us for more information.

Hortonworks 2.3 Release

Hortonworks has been on a tear in the last year. They have made the most noise compared to other Hadoop distributions with new features and community support. With the release of HDP 2.3 today, they are even closer to making Hadoop a mainstream product that both large and small companies can easily adopt.

This post will review some of these new features and how they can help companies adopt this exciting new technology.

Smart Configuration

Hortonworks has combined the last few years of support cases into their own HDP cluster and have come up with many best practices and configurations to help companies run their Hadoop installations much smoother. This new feature adds new dialog boxes and easy to use sliders to help configure your Hadoop cluster with ease.

Screen Shot 2015-07-20 at 9.18.03 PM

YARN Capacity Scheduler

The new Capacity Scheduling tool is a web-based tool with sliders to scale values which makes configuring this complex tool much easier. You can add new queues easier and see the hierarchy between queues in a graphical manner.

Customizable Dashboards

They are finally here. For years, Cloudera Manager had a great dashboard builder and you could customize it any way you like. Now, Ambari is getting closer to mainstream with custom metric widgets. You can create custom KPIs for your cluster helping you keep an eye on the things that need attention over other indicators.


More Ambari Views

The end of Hue is upon us if you are using HDP. 2.3 introduces more Ambari “Views". Views is a plug-in type architecture in Ambari which allows 3rd parties to create UI applications inside of Ambari. Eventually there will be a full marketplace with different “apps" that you can install into Ambari and assign those to user groups. For the 2.3 Sandbox from HDP, there are a few new Views which are Hive, Files, CapacityScheduler, Pig and Tez.


Data-at-Rest Encryption Improvements

Even more improvements to Apache Ranger with HDFS transparent encryption, management of Kafka, Solr and multi-tentant YARN queries. Apache Atlas is also included in this release which is a Data Governance incubator Apache project. We’ll dedicated a post just to this topic in the coming weeks.

Hortonworks SmartSense™

Finally a company that is being proactive about their support and all of the hidden intelligence in that department. SmartSense was created by HDP support which much like any other software support company has a treasure trove of data on customer issues. They created their own HDP cluster and have been doing analytics on this data. The results are a set of common errors, best practices and performance standards based on years of supporting some of the largest Hadoop clusters in the world. It delivers ongoing recommendations on your cluster and offers a long-range cluster resource guide as well as helping you with capacity planning.

Go check out HDP 2.3  and check out our Bit Refinery hosted HDP 2.3 Sandbox.