This is the next part in my on-going series of posts on the topic of how to successfully manage sandboxes within an Oracle data warehouse environment. In Part 1 I provided an overview of sandboxing (key characteristics, deployment models) and introduced the concept of a lifecycle called BOX’D (Build, Observer, X-Charge and Drop). In Part 2 I briefly explored the key differences between data marts and sandboxes. Part 3 explored the Build-phase of our lifecycle.
Now, in this post I am going to focus on the Observe-phase. At this stage in the lifecycle we are concerned with managing our sandboxes. Most modern data warehouse environments will be running hundreds of data discovery projects so it is vital that the DBA can monitor and control the resources that each sandbox consumes by establishing rules to control the resources available to each project both in general terms and specifically for each project.
In most cases, DBAs will setup a sandbox with dedicated resources. However, this approach does not create an efficient use of resources since sharing of unused resources across other projects is just not possible. The key advantage of Oracle Multitenant is its unique approach to resource management. The only realistic way to support thousands of sandboxes, which in today’s analytical driven environments is entirely possible if not inevitable, is to allocate one chunk of memory and one set of background processes for each container database. This provides much greater utilisation of existing IT resources and greater scalability as multiple pluggable sandboxes are consolidated into the multitenant container database.
Using multitenant we can now expand and reduce our resources as required to match our workloads. In the example below we are running an Oracle RAC environment, with two nodes in the cluster. You can see that only certain PDBs are open on certain nodes of the cluster and this is achieved by opening the corresponding services on these nodes as appropriate. In this way we are partitioning the SGA across the various nodes of the RAC cluster. This allows us to achieve the scalability we need for managing lots of sandboxes. At this stage we have a lot of project teams running large, sophisticated workloads which is causing the system to run close to capacity as represented by the little resource meters.
It would be great if our DBA could add some additional processing power to this environment to handle this increased workload. With 12c what we can do is simply drop another node into the cluster which allows us to spread the processing of the various sandbox workloads loads out across the expanded cluster.
Now our little resource meters are showing that the load on the system is a lot more comfortable. This shows that the new multitenant feature integrates really well with RAC. It’s a symbiotic relationship whereby Multitenant makes RAC better and RAC makes Multitenant better.
So now we can add resources to the cluster how do we actually manage resources across each of our sandboxes? As a DBA I am sure that you are familiar with the features in Resource Manager that allow you to control system resources: CPU, sessions, parallel execution servers, Exadata I/O. If you need a quick refresher on Resource Manager then check out this presentation by Dan Norris “Overview of Oracle Resource Manager on Exadata” and the chapter on resource management in the 12c DBA guide.
With 12c Resource Manager is now multitenant-aware. Using Resource Manager we can configure policies to control how system resources are shared across the sandboxes/projects. Policies control how resources are utilised across PDBs creating hard limits that can enforce a “get what you pay for” model which is an important point when we move forward to the next phase of the lifecycle: X-Charge. Within Resource Manager we have adopted an “industry standard” approach to controlling resources based on two notions:
To help DBAs quickly deploy PDBs with a pre-defined set of shares and utilisation limits there is a “Default” configuration that works, even as PDBs are added or removed. How would this work in practice? Using a simple example this is how we could specify resource plans for the allocation of CPU between three PDBs:
As you can see, there are four total shares, 2 for the data warehouse and one each for our two sandboxes. This means that our data warehouse is guaranteed 50% of the CPU whatever else is going on in the other sandboxes (PDBs). Similarly each of our sandbox projects is guaranteed at least 25%. However, in this case we did not specify settings for maximum utilisation. Therefore, our marketing sandbox could use 100% of the CPU if both the data warehouse and the sales sandbox were idle.
By using the “Default” profile we can simplify the whole process of adding and removing sandboxes/PDBS. As we add and remove sandboxes, the system resources are correctly rebalanced, by using the settings specific default profile, across all the plugged-in sandboxes/PDBs as shown below.
In this latest post on sandboxing I have examined the “Observe” phase of our BOX’D sandbox lifecycle. With the new multitenant-aware Resource Manager we can configure policies to control how system resources are shared across sandboxes. Using Resource Manager it is possible to configure a policy so that the first tenant in a large, powerful server experiences a realistic share of the resources that will eventually be shared as other tenants are plugged in.
In the next post I will explore the next phase of our sandbox lifecycle, X-charge, which will cover the metering and chargeback services for pluggable sandboxes.
Data Warehousing and Big Data were at the heart of this year’s OpenWorld conference being across in a number of keynotes and a huge number of general sessions. Our hands-on labs were all completely full as people got valuable hands-on time with our most important new features. The key areas at this year’s conference were:
All these topics appeared in the main keynote sessions including live on-stage demonstrations of how each feature can be used to increased the performance and analytical capability of your data warehouse.
If you want to revisit the most important sessions, or if simply missed this year’s conference and want to catch up on all the most important topics, then I have put together a book of the highlights from this year’s conference. The booklet is divided into the following sections:
You can download my review in PDF format by clicking here. Hope this proves useful and if I missed anything then let me know.
Since the term big data first appeared in our lexicon of IT and business technology it has been intrinsically linked to the no-SQL, or anything-but-SQL, movement. However, we are now seeing that SQL is experiencing a renaissance. The term “noSQL” has softened to a much more realistic approach - a "not-only-SQL" approach. And now there is an explosion of SQL-based implementations designed to support big data. Leveraging the Hadoop ecosystem, there is: Hive, Stinger, Impala, Shark, Presto and many more. Other NoSQL vendors such as Cassandra are also adopting flavors of SQL. Why is there a growing level of interest in the reemergence of SQL? Probably, a more pertinent question is: did SQL ever really go away? Proponents of SQL often cite the following explanations for the re-emergence of SQL for analysis:
However, despite the virtues of these explanations, they alone do not explain the recent proliferation of SQL implementations. Consider this: how often does the open-source community embrace a technology just because it is the corporate orthodoxy? The answer is: probably not ever. If the open-source community believed that there was a better language for basic data analysis, they would be implementing it. Instead, a huge range of emerging projects, as mentioned earlier, have SQL at their heart The simple conclusion is that SQL has emerged as the de facto language for big data because, frankly, it is technically superior. Let’s examine the four key reasons for this:
The concept of SQL is underpinned by the relational algebra - a consistent framework for organizing and manipulating sets of data - and the SQL syntax concisely and intuitively expresses this mathematical system.
Most business users, data analysts and even data scientists think about data within the context of a spreadsheet. If you think about a spreadsheet containing a set of customer orders then what do most people do with that spreadsheet? Typically, they might filter the records to look only at the customer orders for a given region. Alternatively, they might hide some columns: maybe the customer address is not needed for a particular piece of analysis, but the customer name and their orders are important data points. Finally, they might add calculations to compute totals and/or perhaps create a cross tabular report.
Within the language of SQL these are common steps: 1) projections (SELECT), 2) filters and joins (WHERE), and 3) aggregations (GROUP BY). These are core operators in SQL. The vast majority of people have found the fundamental SQL query constructs to be straightforward and readable representation of everyday data analysis operations.
When a developer writes a SQL query, he or she simply describes the results that they want. The developer does not have to get into any of the nitty-gritty of describing how to get the results
This type of approach is often referred to as 'declarative programming,’ and it makes the developer's job easier. Even the simplest SQL query illustrates the benefits of declarative programming:
SELECT day, prcp, temp FROM weather
WHERE city = 'San Francisco' AND prcp > 0.0;
SQL engines may have multiple ways to execute this query (for example, by using an index). Fortunately the developer doesn't need to understand any of the underlying database processing techniques. The developer simply specifies the desired set of data using projections (SELECT) and filters (WHERE).
This is perhaps why SQL has emerged as such an attractive alternative to the MapReduce framework for analyzing HDFS data. MapReduce requires the developer to specify, at each step, how the underlying data is to be processed. For the same “query", the code is longer and more complex in MapReduce. For the vast majority of data analysis requirements, SQL is more than sufficient, and the additional expressiveness of MapReduce introduces complexity without providing significant benefits.
The fact that SQL is a declarative language not only shields the developer from the complexities of the underlying query techniques, but also gives the underlying SQL engine has a lot of flexibility in how to optimize any given query.
In a lot of programming languages, if the code runs slow, then it's the programmer's fault. For the SQL language, however, if a SQL query runs slow, then it's the SQL engine's fault.
This is where analytic databases really earn their keep – databases can easily innovate ‘under the covers’ to deliver faster performance; parallelization techniques, query transformations, indexing and join algorithms are just a few key areas of database innovation that drive query performance.
SQL provides a robust framework that adapts to new requirements
SQL has stayed relevant over the decades because, even though its core is grounded in universal data processing techniques, the language itself can be extended with new processing techniques and new calculations. Simple time-series calculations, statistical functions, and pattern-matching capabilities have all been added to SQL over the years.
Consider, as a recent example, what many organizations realized as they started to ask queries such as 'how many distinct visitors came to my website last month?' These organizations realized that it is not vital to have a precise answer to this type of query ... an approximate answer (say, within 1%) would be more than sufficient. This has requirement has now been quickly delivered by implementing the existing hyperloglog algorithms within SQL engines for 'approximate count distinct' operations.
More importantly, SQL is a language that is not explicitly tied to a storage model. While some might think of SQL as synonymous with relational databases, many of the new adopters of SQL are built on non-relational data. SQL is well on its way to being a standard language for accessing data stored in JSON and other serialized data structures.
SQL is an immensely popular language today … and if anything its popularity is growing as the language is adopted for new data types and new use cases. The primacy of SQL for big data is not simply a default choice, but a conscious realization that SQL is the best suited language for basic analysis
PS. Next week, many sessions at this year’s OpenWorld will focus on the power, richness and performance of SQL for sophisticated data analysis including the following:
Monday September 28
Using Analytical SQL to Intelligently Explore Big Data @ 4:00PM Moscone North 131
Joerg Otto - Head of Database Engineering, IDS GmbH
Marty Gubar - Director, Oracle
Keith Laker - Senior Principal Product Manager, Data Warehousing and Big Data, Oracle
YesSQL! A Celebration of SQL and PL/SQL @ 6:00PM Moscone South 103
Tuesday September 29
SQL Is the Best Development Language for Big Data @ 10:45AM Moscone South 104
Enjoy OpenWorld 2014 and if you have time please come and meet the Analytical SQL team in the Moscone South Exhbition Hall. We will be on the Parallel Execution and Advanced SQL Processing demo booth (id 3720).
There's so much to see and learn at Oracle OpenWorld because it provides more educational and networking opportunities than any other conference dedicated to Oracle business and technology users.
What to expect at OOW 2014 - We will be announcing a wide range of continuous data warehouse innovations in both hardware and software. Join Oracle experts as we dive deep into the latest generation of data warehouse innovations for analyzing enterprise data and diverse big data streams to derive real business value. You will also learn data warehouse best practices and hear from customers consolidating business analysis onto a common scalable platform. Hands-on labs are available for both beginners and experts giving you the chance to try some of these innovative data warehouse technologies first-hand.
To help you get the most from this year’s event I have put together a comprehensive downloadable guide of all the data warehousing and big data activities at @OracleOpenWorld 2014. If you are smartphone and/or tablet user then checkout our amazing web apps (see previous post OpenWorld on your iPad and iPhone - Now Fully Operational!). If you don’t have a tablet or a suitable smartphone of just want a downloadable booklet then this guide contains everything you need to help you get the most from this year’s conference, including the following:
|Click here to download Guide in Apple iBook format
Please note that this Apple iBook can be used on any Apple Mac computer or iPad running the iBook application. iPod touch and iPhone users should use the PDF version of this guide.
|Click here to download Guide in PDF format|
Enjoy @OracleOpenWorld 2014 and if you have time please stop by the Parallel Execution and Analytical SQL demo booth in the demo grounds and say hello.
In one of my recent blog post I provided links to our OpenWorld data warehouse web app for smartphones and tablets. Now that the OOW team has released the hands-on lab schedule (it is now live on the OpenWorld site) I have updated my smartphone and tablet apps to include the list of hands-on labs on a day-by-day basis (Monday Tuesday, Wednesday, Thursday). The list of hands-on labs can still be viewed in subject area order (data warehousing and big data) within the app via the “Switch to subject view” link in the top left part of the screen.
I have also added a location map which can be viewed by clicking on the linked-text, “View location map", which is in the top right part of the screen on each application. The location map that is available within both the tablet and smartphone apps is shown below:
If you want to run these updated web apps on your smartphone and/or tablet then you can reuse the existing links that I published on my last blog post. If you missed that post then follow these links:
Android users: I have tested the app on Android and there appears to be a bug in the way the Chrome browser displays frames since scrolling within frames does not work . The app does work correctly if you use either the Android version of the Opera browser or the standard Samsung browser on Samsung devices.
Please note that I have also published online calendars (via my Google account) which can viewed via the following blog posts:
If you have any comments about the app (content you would like to see) then please let me know. Enjoy OpenWorld and, if you have time, it would be great to see you if you want to stop by at the Parallel Execution and Analytical SQL demo booth.
There so many exciting hands-on labs at this years OpenWorld conference and the schedule builder is now live so you can start booking your seat at these labs. To help you get organized and pick the most useful labs to attend I have published a new shared online calendar that contains all the most important data warehouse and big data hands-on labs at this year’s OpenWorld. The following links will allow you to add this shared calendar to your own calendar application:
Hope this helps you get organised for this year’s incredible conference. Any comments then let me know. The online calendar for all the most important data warehousing and big sessions is available via my previous blog post “Online Calendar for Data Warehousing Sessions at OpenWorld now available”.