The need for a private, secure and safe area for data discovery within the data warehouse ecosystem is growing rapidly as many companies start investing in and investigating "big data". Business users need space and resources to evaluate new data sources to determine their value to the business and/or explore news way of analyzing existing datasets to extract even more value. These safe areas are most commonly referred to as "Sandboxes" or "Discovery Sandboxes" or "Discovery Zones". If you are not familiar with the term then Forrester Research defines a "sandbox" as:
“data exploration environment where a power user can analyse production […] with near complete freedom to modify data models, enrich data sets and run the analysis whenever necessary, without much dependency on IT and production environment restrictions.” *1
These sandboxes are tremendously useful for business users because they allow them to quickly and informally explore new data sets or new ways of analyzing data without having to go through the formal rigour normally associated with data flowing into the EDW or deploying analytical scripts within the EDW. They provide business users with a high degree of freedom. The real business value is highlighted in a recent article by Ralph Kimball:
In several of the e-commerce enterprises interviewed for this white paper, analytic sandboxes were extremely important, and in some cases hundreds of the sandbox experiments were ongoing simultaneously.
As one interviewee commented “newly discovered patterns have the most disruptive potential, and insights from them lead to the highest returns on investment" *2
So what are they key characteristics of a sandbox? Essentially there are three:
- Used by skilled business analysts and data scientists
- Environment has fewer rules of engagement
- Time boxed
Sandboxes are not really designed to be used by CIOs or CEOs or general BI users. They are designed for business analysts and data scientists who have a strong knowledge of SQL, detailed understanding of the business and the source data that is being evaluated/analyzed. As with many data exploration projects you have to be able to understand the results that come back from a query and be able to determine very quickly if they make sense.
As I stated before, the normal EDW rules of engagement are significantly relaxed within the sandbox and new data flowing into the sandbox is typically disorganised and dirty. Hence the need for strong SQL skills to create simplified but functional data cleaning and transformation scripts with the emphasis being to make new data usable as quickly as possible. Part of the "transformation" process might be to generate new data points derived from existing attributes. A typical example of this is where a data set contains date-of-birth information, which in itself is quite a useful piece of information, that can be transformed to create a new data point of "age". Obviously the business analysts and data scientist need to be reasonably proficient in SQL to create the required transformation steps - it is not a complicated process but it highlights the point that the business community needs to have the necessary skills so that they are self-sufficient.
Most importantly the sandbox environment needs to have a time limit. In the past this is where most companies have gone wrong! Many companies fail to kill off their sandboxes. Instead these environments evolve and flourish into shadow marts and/or data warehouses which end up causing havoc as users can never be sure which system contains the correct data. Today, most enlightened companies enforce a 90-day timer on their sandboxes. Once the 90 day cycle is complete then ownership of the processes and data are either moved over to the EDW team, who can then start to apply the corporate standards to the various objects and scripts, or the environment and all its data is simply dropped.
The only way a business can support the hundreds of live sandbox experiments described in Kimball's recent report (*2) is by enforcing these three key characteristics.
Choosing your deployment model:
Over the years that I have spent working on various data warehouse projects I have seen a wide variety of weird and wonderful deployment models designed to support sandboxing. In very general terms these various deployment models reduce down to one of the following types:
- Desktop sandbox
- Detached sandbox
- Attached sandbox
each one of these deployment models has benefits and advantages as described here:
1. Desktop Sandboxes
Many business users prefer to use their desktop tools, such as spreadsheet packages, because the simple row-column data model gives them a simplified and easily managed view of their data set. However, this approach places a significant processing load on the desktop computer (laptop or PC) and while some vendors offer a way to off-load some of that processing to bespoke middleware servers this obviously means implementing an additional specialised middleware server on dedicated hardware. Otherwise, companies have to invest large amounts of money upgrading their desktop systems with additional memory and solid-state disks.
Creating a new sandbox is just a question of opening a new, fresh worksheet and loading the required data set. Obviously, the size and breadth of the dataset is limited by the resources on the desktop system and complicated calculations can take a considerable time to run with little or no scope for additional optimisation or tuning. Desktop sandbox are, by default, data-silos and completely disconnected from the enterprise data warehouse which makes it very difficult to do any sort of joined-up analysis.
The main advantage of this approach is that power users can easily run what-if models where they redefine their data model to test new "hierarchies", add new dimensions or new attributes. They can even change the data by simply over-typing existing values. Collaboration is a simple process of emailing the spreadsheet model to other users for comments. The overriding assumption here is that users who receive the spreadsheet are actually authorised to view the data! Of course there is nothing to prevent recipients forwarding the data to other users. Therefore, it is fair to say that data security is non-existent.
For DBAs, the biggest problem with this approach is that it offers no integration points into the existing cloud management infrastructure. Therefore, it is difficult for the IT team to monitor the resources being used and make appropriate x-charges. Of course the DBA has no control over the deletion of desktop based sandboxes so there is a tendency for these environments to take on a life of their own with business users using them to create "shadow" production systems that are never decommissioned.
Overall, the deployment of desktop sandboxes is not recommended.
2. Detached Sandboxes
Using a detached, dedicated sandbox platform resolves many of the critical issues related to desktop sandbox platforms most notably the issues relating to: data security and processing scalability. Assuming a relatively robust platform is used to manage the sandboxes then the security profiles implemented in the EDW can be replicated across to the stand-alone platform. This approach still allows users to redefine their data model to test new "hierarchies", add new dimensions or new attributes within what-if models and even change data points but this ability is "granted" by the DBA rather than being automatically taken and enforced by the business user. In terms of sharing results there is no need to distribute data via email and this ensures everyone gets the same consistent view of the results (and by default the original source, should there be a need to work backwards from the results to the source).
Key concerns for business users is the level of latency that occurs from the need to unload and reload not only the required data but also all the supporting technical and business metadata. Unloading, moving and importing large historical data sets can be very time consuming and can require large amounts of resources on the production system - which may or may not be available depending on the timing of the request.
For the DBA issues arise around the need to monitor additional hardware and software services in the data center. For IT this means more costs because additional floor space, network bandwidth, power and cooling may be required. Of course, assuming that the sandbox platform fits into the existing monitoring and control infrastructure then x-charging can be implemented. In this environment the DBA has full control over the deletion of a sandbox so they can prevent the spread of "shadow" production data sets. For important business discoveries, the use of detached sandboxes does provide the IT team with the opportunity to grab the loading and analysis scripts and move them to the production EDW environment. This helps to reduce the amount of time and effort needed to "productionize" discoveries.
While detached sandboxes remove some of the disadvantages of desktop platforms it is still not an ideal way to deliver sandboxes to the business community.
3. Attached Sandboxes
Attached sandboxes resolve all the problems associated with the other two scenarios. Oracle provides a rich set of in-database features that allow business users to work with in-place data, which in effect, removes the issue of data latency. Oracle Database is able to guarantee complete isolation for any changes to dimensions, hierarchies, attributes and/or even individual data points so there is no need to unload, move and then reload data. All the existing data security policies remain in place which means there is no need to replicate security profiles to other systems where there is the inherent risk that something might be missed in the process.
For the DBA, x-charging can be implemented using existing infrastructure management tools. The DBA has full control over the sandbox in terms of resources (storage space, CPU, I/O) and duration. The only concern that is normally raised regarding the use of attached sandboxes is the impact on the existing operational workloads. Fortunately, Oracle Database, in conjunction with our engineered systems, has a very robust workload management framework (see earlier posts on this topic: https://blogs.oracle.com/datawarehousing/tags/Workload_Management). This means that the DBA can allocate sufficient resource to each sandbox while ensuring that the key operational workloads continue to meet their SLAs. Overall, attached sandboxes, within an Oracle Database environment, is a win-win solution: both the DBA and the business community get what they need.
High degree of local control over data
Reduced data scalability
Reduces workload on EDW
Requires additional hardware and software
Upload additional data to virtual partitions Easy to mix new data with corporate data
Requires robust workload management tools
From this list of pros and cons it is easy to see that the "Attached Sandbox" is the best deployment model to use. Fortunately, Oracle Database 12c has a number of new features and improvements to existing features that mean it is the perfect platform for deploying and managing attached sandboxes.
B-O-X-D: the lifecycle of a sandbox
Now we know what type of sandbox we need to deploy (just in case you were not paying attention - attached sandboxes!) to keep our business users happy the next step is to consider the lifecycle of the sandbox along with the tools and features that support each of the key phases. To make things easier I have broken this down into four key DBA-centric phases as shown below:
Over the next four weeks I will cover these four key phases of the sandbox lifecycle and explain which Oracle tools and Oracle Database features are relevant and how they can be used.
*1 Solve the Data Management Conflict Between Business and IT, by Brad Peters - Information Management Newsletters, July 20, 2010
*2 The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics by Ralph Kimball