Monday, October 14, 2013

10 successful big data sandbox strategies

Keep in mind these ten strategies when building and managing big data test environments. 
Being able to experiment with big data and queries in a safe and secure “sandbox” test environment is important to both IT and end business users as companies get going with big data. Nevertheless, setting up a big data sandbox test environment is different from establishing traditional test environments for transactional data and reports. Here are ten key strategies to keep in mind for building and managing big data sandboxes:

1. Data mart or master data repository?

The data base administrator needs to make a decision early-on as to whether to have test sandboxes use data directly from the master data repository that production uses, or whether the best solution is to replicate and splinter off sections of this data into separate data marts that are reserved for testing purposes only. The advantage of the full data repository is that testing actually uses data that is used in production, so test results will be more accurate. The disadvantage is that data contention can be created with production itself. With the data mart strategy, you don’t risk contention with production data—but the data will likely need to be periodically refreshed to stay in some degree of synchronization with data being used in production if it is going to closely approximate the production environment.

2. Work out scheduling

Scheduling is one of the most important big data sandbox activities. It ensures that all sandbox work is optimally being run. It usually achieves this by concurrently scheduling a group of smaller jobs that can be completed while a longer job is being run. In this way, resources are allocated to as many jobs as possible. The key to this process is for IT to sit down with the various user areas that are using sandboxes so everyone has an upfront understanding of the schedule, the rationale behind it, and when they can expect their jobs to run.  

3. Set limits

If months go by without a specific data mart or sandbox being used, business users and IT should have mutually acceptable policies in place for purging these resources so they can be put back into a resource pool that can be re-provisioned for other activities. The test environment should be managed as effectively as its production environment counterpart so that resources are called into play only when they are actively being used.

4. Use clean data

One of the preliminary big data pipeline jobs should be preparing and cleaning data so that it is of reasonable quality for testing, especially if you are using the “data mart” approach. It is a bad habit (dating back to testing for standard reports and transactions) to use data in test regions that is incomplete, inaccurate, or even broken—simply because it was never cleaned up before it was dumped into a test region. Resist this temptation with big data.

5. Monitor resources

Assuming big data resources are centralized in the data center, IT should set resource allowances and monitor sandbox utilization. One area often requiring close attention is the tendency to over-provision resources as more end user departments engage in sandbox activities.

6. Watch for project overlap

At some point, it makes sense to have a corporate “steering committee” for big data that tracks the various sandbox projects going on throughout the company to ensure that there is no overlap and/or duplicated effort.  

7. Consider centralizing compute resources and management in IT

Some companies start out with big data projects in specific departments but quickly learn that they can’t work on big data, do their daily work, and then manage compute resources, too. Ultimately, they move the equipment into the data center for IT to manage. This frees them to focus on the business and ways that big data can bring in value.

8. Use a data team

Even in sandbox experimentation, it’s important to have the requisite big data skills team on hand to assist with tasks. Typically, this team consists of a business analyst, a data scientist, and an IT support person who can fine-tune hardware and software resources and coordinate with database specialists.

9. Stay on task with business cases

It’s important to infuse creativity into sandbox activities, but not to where you totally forget the initial charge of the business case you’re trying to bring value to.

10. Define what a sandbox is!

Especially participants coming from the end business might not be familiar with the term “sandbox” or what it implies. Like the childhood sandbox, the purpose of a big data sandbox is to freely play and experiment with big data—but to do it with purpose. Part of this purposeful activity should be abiding by the ground rules of the sandbox, such as when, where and how to use it, as well as experimenting to derive meaningful results for the business.

No comments:

Post a Comment