Hadoop for Datawarehouse Offloading

We’ve heard a lot lately about using Hadoop for “Data Warehouse Offloading” with folks talking about it and how it can save a company money. We wanted to share a success story for a customer of ours that has implemented an offloading project allowing them to save money while opening up a new world of possible analytics.

We cannot share their name but they are a major retail organization with over 450 stores nationwide. Each store has a IBM iSeries server that provides data to the POS (Point of Sale) systems. Back at corporate, they have a much larger iSeries that stores all the data as well as some internal information such as employee and product data. Their current data warehouse sits on Teradata with Oracle providing some master data management functions.

The below diagram depicts at a high level their infrastructure before adding Hadoop. They rely heavily on external data sources for surveys, loyalty programs, and other marketing data.

datawarehouse_offloading_1

The IBM iSeries servers acts as the main processing unit in the company with the Teradata warehouse appliance for reporting and analysis. The main problem the IT department struggled with is the ongoing cost and performance of the Teradata appliance and their continued data growth. An analysis was completed on the data in the warehouse and it was determined that over 80% of the data was stagnant. The data was brought into the warehouse, aggregated into “aggregation” tables and these tables were the only ones included in standard and adhoc reporting.

Problems

  • Continually rising cost of the Teradata appliance
  • Performance issues with deep analysis queries against years of data
  • Potential data gold mines in the stagnant data in the warehouse

This company had looked at Hadoop before and dismissed it because they didn’t consider themselves a “Big Data” company. They had recently uploaded some data into Google and liked the features and performance but knew this wasn’t a long-term solution as they still wanted control over their data due to security concerns. On the other hand, their data center is aging and there was a push to move their current equipment to a managed data center to reduce cost and increase stability.

They chose a small 4 node Hortonworks cluster hosted by Bit Refinery. This provided them with around 10TB of usable storage which allowed space to grow.  This helped slow down the data being placed on Teradata and gave them immediate financial relief. (Additional TB of Teradata is around $15k)

Total cost of the cluster was $2,000/mo and they can add data nodes for $300/mo as they continue to use Hadoop to store their low-level data and push it back up to Teradata which supplies excellent reporting capabilities.

datawarehouse_offloading_2

The new architecture has raw data from external sources along with the iSeries data landing directly in Hadoop, aggregated and then sent to Teradata. This allows reports to run quickly in Teradata and for deep analysis to be performed on the low level, raw data in Hadoop.

Since the Hadoop cluster is hosted at Bit Refinery, their internal network at the corporate office was extended directly to their hosted cluster at Bit Refinery via layer 2. This ensures total control over the data as well as bandwidth usage. Data is encrypted at rest and during transfer and since the Hadoop data nodes are dedicated to them, security concerns were quickly laid to rest

We hope this article provides a real-world example of how a company that doesn’t necessarily have a “Big Data” problem uses Hadoop to not only save money on their IT infrastructure but now has plan to house all of their low-level data into a cost effective data store that allows for deep analysis of the data.

Want more information? Here is a good talk from the Hadoop Summit about how a company saved $30 million using this technique.

Have questions for us? Contact us for more information.