Thursday, 19 February 2015

Part 2 - Why SQL is the natural language for data analysis

 In the first part (Why SQL is the natural language for data analysis)of this series of blog posts I explained why analytics is important to the business, how the volume of data along with the types of data is continuing to expand and why that makes it vital that you select the right language for data analysis. Many of us work with SQL every day and take for granted many of its unique features, its power, flexibility and the richness of the analytics. This familiarity with SQL means that sometimes we are a bit slow at preventing some of our projects investing in other big data languages such as MapReduce, Impala, Spark, Java and many of the new generation of SQL-like open source projects. While many of these new languages are considered “cool”, it is very easy to end up building new proprietary data silos or investing in a language that eventually is replaced by another open source project or eventually fails to deliver the required analytics.

One of the aims of this series of blog posts is to refresh your memory about why SQL has been so successful in the area of analytics. SQL has four unique features that make it perfect for data analysis:

  • Powerful framework
  • Transparent optimization
  • Continuous evolution
  • Standards based

The aim of this post is to explore the first of these features: powerful framework  

The reason for collecting data is because it needs to be interrogated: events needs to be dissected, customers need to be profiled, product sales evaluated. This interrogation process is typically framed around a set of data rather than a single data point or row of data and mathematical rules have been created to provide a precise framework for this analytical process. The framework is “relational algebra” which was first described in 1970 by E.F. Codd while at IBM . It provides well-founded semantics for both modeling and querying a set of data.

Codd's framework established a series of basic principles that govern the construction and execution of a query and these can be summarized as follows:

  1. Projection
  2. Filter
  3. Join
  4. Aggregate

The ideas behind relational algebra (set theory) are not just limited to the points that will be covered in this particular section. These concepts have implications that cascade right across the other unique features. While the elements of this basic framework are constant the way that these elements are implemented within many of the current big data related SQL-like languages differ based on the approach adopted by each of the open source projects/languages.

Let’s explore each the four basic principles starting with “Projections”.

1. Projections

When reviewing the concept of a “set of data” in the context of SQL it is important to remember that each source set of data contains two separate sets: a column set and a row set. When interrogating a set of data the first step is to determine which columns within the set are of interest., projections of the total set of columns.. When a database system does projections, it is simply determining which columns are necessary for a given analysis, and discards the other columns from the analysis.

The Oracle Database has a comprehensive metadata layer that supports the discovery of columns across a wide range of data sets: relational tables/views, XML documents, JSON documents, spatial objects, image-style objects (BLOBs and CLOBs), semantic networks etc.

The first part of the SELECT clause is used to list the columns of interest within the set of data and this can include all the columns rather than simply a subset. The syntax for selecting all columns within a set, in this case using a table called EMP, is:

SELECT * FROM emp;

The Oracle Database uses its metadata layer to establish the list of all column names associated with the set and then internally expands the query statement to include all the relevant column names.

 

Slide01

Alternatively the query can specify the required columns within the set such as:

SELECT ename, job, hiredate FROM emp;

Slide02

This level of sophistication (both metadata and automatic statement expansion) is missing in many data manipulation languages and this requires developers to add additional bespoke code to cope with these basic requirements. SQL also supports the concept of extended projections. This is where new data columns are created within the result set using arithmetic operations. For example, taking the columns SAL and COMM it is possible to compute the rate of commision by dividing SAL by COMM to create a new column:

SELECT
  ename,
  sal,
  comm,
  comm/sal*100 AS comm_rate
FROM emp;

Oracle’s Big Data SQL feature extends this metadata model so that it can encompass data stored inside NoSQL databases, JSON documents, and files stored on HDFS. This comprehensive metadata layer makes it very easy for developers to create dynamic SQL statements containing column projections based around a wide range of data sources.

 

2. Filters

The next stage within the query framework is to specify the rows that are of interest. Since these cannot be identified in the same way as columns, using names, a different approach is used for row-based projections. This approach requires the use of filters – i.e. describing the attributes associated with row sets that are of interest. Within the SQL language row filtering is part of the WHERE clause syntax (these filters are also often called predicates in relational terminology).

The developer can define specific filtering criteria that rows must meet to be included in the result set. For example using the employee data set it is possible to extend our previous example to limit the set of rows returned by a query to just those associated with the job type ‘CLERK’ :

SELECT
  empno,
  ename, 
  job,
  mgr,
  hiredate,
  sal, 
  comm, 
  deptno
FROM emp
WHERE job=”CLERK”;

Slide03

 

Applying filters cannot only reduce the returned data set based on attributes describing the data – such as the job type above – it can also reduce the returned data set to an absolute or relative subset of the result set.

For business-driven queries this process of limiting rows is essential. A query can be structured to return a specified number or percent of rows starting with the first row after the offset. The offset allows for modification of typical questions, so that the question about highest-paid employees might skip the top ten employees and return only those from eleventh to twentieth place in the salary rankings. In a similar manner it is possible to query the employees data set by salary, skip the top ten employees and then return the top 10% of the remaining employees.

The SQL language supports a wide range of filtering comparison operators to identify specific rows:

Operator Description
=, !=, <> Test for equal to, not equal to, not equal to
>, >=, Test for greater than, greater than or equal to, less than, less than or equal to
BETWEEN ...
AND ...
Checks for a range between and including two values
LIKE Searches for a match in a string, using the wildcard symbols % (zero or multiple characters) or _ (one character)
IN ( )
NOT IN ( )
Tests for a match, or not match, in a specified list of values
IS NULL
IS NOT NULL
Checks whether a value is null, is not null

 

While it is possible to filter rows based on values, for some types of application-driven queries this process of limiting rows can be extended. A query can be structured to return a specified number or percent of rows starting with the first row after the offset. For application developers this process of limiting rows using the SQL language is extremely flexible. For example, it can be used to aid testing or provide a pagination feature for passing data to a front-end visualization. The Oracle Database supports a number of techniques to help developers meet this requirement such as:

  • FETCH FIRST N/N-PERCENT
  • RANK
  • ROW_NUMBER()

These features can be used within user interfaces to provide the first few rows of a data set for browsing. The SQL language has a very rich set of filtering techniques that are both simple to implement and to amend as requirements evolve over time.

 

3. Joins

Most query operations require the use of at least two data sets and this necessitates a “join” operation. At a simplistic level, a join is used to combine the fields within two or more data sets in a single query, based on common logical relationships between those various data sets. In our simple data example, suppose that there was a department data set in addition to the employee data set. A query might ‘join’ the department and employee data to return a single result combining data from both data sets. There are a number of ways that data sets can be joined:

  • Inner - returns all rows where there is at least one match in both tables
  • Full Outer - returns all rows from both tables, with matching rows from both sides where available. If there is no match, missing side will contain null.
  • Outer left - returns all rows from left table, and matched rows from right table
  • Outer right - returns all rows from right table, and matched rows from left table
  • Cross - returns a Cartesian product of source sets, i.e. all rows from left table for each row in the right table 

The process of defining and executing a SQL join is simple. The required type of join is implemented using the WHERE clause:

SELECT
  d.deptno,
  d.dname,
  e.empno,
  e.ename
FROM dept d
INNER JOIN emp e ON (e.deptno = d.deptno);

Note that there is nothing in this SQL query that describes the actual join process for the two data sets.

In many procedural languages the process of joining two sets of data can be complicated due to the need to explicitly code each join structure and join algorithm for each combination of columns across the various data sets. The level of complication builds as additional data sets are added to the query; different join operations are required for combinations of data sets such as taking into account missing values within data sets. The join order of multiple data sets has to be decided on and coded procedurally. It is very easy for the amount and complexity of code to increase dramatically as the number of data sets increases. This makes data processing validation and debugging very challenging.  Once additional requirements are applied, such as aggregating specific data points, the level of code complexity within procedural languages can escalate very quickly. 

For example when we join the two tables EMP and DEPT there could be departments that contains no employees. Using an inner join, the departments with zero employees are not be returned. If a query needs to return a count of the number of employees in every department, including the ‘empty’ departments containing zero employees, then an outer join is required as shown here:

 

SELECT
d.deptno,
count(e.empno)
FROM dept
LEFT OUTER JOIN emp e ON (e.deptno = d.deptno)
GROUP BY d.deptno ORDER BY d.deptno;

 

 Slide04

 

Hopefully, if you study the above code samples It is clear that SQL’s join code is easily readable – and code-able. The developer only specifies the semantic join condition and leaves the processing details - such as the order of the joins - to the SQL engine. 

 

4. Aggregate

Aggregation is an important step in the process of analyzing data sets. Most operational, strategic and discovery-led queries rely on summarizing detailed level data. According to a TDWI report “Data Aggregation - Seven Key Criteria to an Effective Aggregation Solution” - up to 90% of all reports contain some level of aggregate information. Therefore, the ability to simply and efficiently aggregate data is a key requirement when selecting a language. If the aggregation process is correctly implanted it can generate significant performance benefits, which creates new opportunities for organizations to boost their overall analysis and reporting capabilities.

The types of aggregation applied to a data set can vary from simple counts to sums to moving averages to statistical analysis such as standard deviations. Therefore, the ability to simply and efficiently aggregate data is a key requirement for any analytical data language. Procedural languages, such as the Java based MapReduce, are more than capable of taking a data set and aggregating it, or reducing it, to provide a summary result set. The approach is adequate for most simple data discovery style queries, i.e. those that include basic counts. However, adding more complex aggregation requirements quickly increases the amount of code required to manage the computations. SQL has a rich set of data aggregation capabilities that make it easy to work on all rows in a set. For example, it is possible to sum all rows within a single column as follows:

SELECT
  SUM(sal) AS total_salary
FROM emp;

It is easy to extend this query to accommodate new requirements such as a count of the number of employees and the average salary:

SELECT
  COUNT(empno) AS no_of_employees,
  SUM(sal) AS total_salary,
  AVG(sal) As average_salary
FROM emp;

Taking these examples even further, it is a simple step to group rows into specific categories using the GROUP BY clause. The aggregate functions and GROUP BY clause group can be used to group the data and then apply the specific aggregate function(s) to count the number of employees, sum the salary and calculate the average salary in each department within a single query as shown below:

SELECT
  deptno,
  COUNT(empno) AS no_of_employees,
  SUM(sal) AS total_salary,
  AVG(sal) AS average_salary
FROM emp
GROUP BY deptno;

The ANSI SQL:2003 standard (more on this towards the end of this paper) extended the process of aggregating data by introducing the concept of analytic functions. These functions divide a data set into groups of rows called partitions making it is possible to calculate aggregates for each partition and for rows within each partition.

The use of additional keywords define the how analytical functions, such as average, sum, min, max etc., will be evaluated within each partition. Using the previous example, the statement below creates analytical reporting totals such as: total salary within each department, moving average salary within each department and average salary:

SELECT
  d.deptno,
  d.dname,
  e.ename,
  e.sal AS sal,  SUM(e.sal) OVER (ORDER BY e.deptno)) AS dept_sal,
  ROUND(AVG(e.sal) OVER (PARTITION BY e.deptno ORDER BY e.empno)) AS moving_avg_sal,
  ROUND(AVG(e.sal) OVER (ORDER BY e.deptno)) AS avg_dept_sal
FROM dept d
LEFT OUTER JOIN emp e ON (e.deptno = d.deptno);

 

Compare the simplicity of the above SQL code for computing a moving average with the equivalent MapReduce code posted by Cloudera on Github as part of the blog post “Simple Moving Average, Secondary Sort, and MapReduce”. This “simple” example consists of twelve java program files to perform the moving average calculation, manage the job parameters and specify the framework for the associated workflow. Making changes to the calculations as business requirements evolve will be a significant challenge given the amount of code within this “simple” project.

SQL offers an additional ability to apply restrictions using columns that are returned as part of the result set. The HAVING clause filters results based on calculations in the SELECT clause and/or aggregations derived from the GROUP BY processing, in the same way that <br> filtering clause is applied, for example: 

SELECT
deptno AS department,s
COUNT(empno) AS no_employees,
AVG(sal) AS avg_sal,
SUM(sal) AS tot_sal
FROM emp
GROUP BY deptno
HAVING avg(sal) > 2500;

 

Using SQL, developers and DBAs can leverage simple, convenient and efficient data aggregation techniques that require significantly less program code compared to using other analytical programming languages. The simplicity provided by SQL makes it easier and faster to construct, manage and maintain application code and incorporate new business requirements.

 

Summary

Hopefully within this blog post I have explained the primary operators that are part of relational algebra (set theory). Given that most of us use SQL every single day it is very easy to forget the power and sophistication that is going on under the covers. It all gets taken for granted until you find yourself working with a different language that takes a completely different framework.

Hopefully, it is clear that SQL offers a simple and efficient way to write queries compared to some of the other newer big data related languages that are emerging through the open source community. This concludes my look at the first of the four key reason as to why SQL is one of the most successful languages. In my next post, which will probably appear towards the end of this week, I will explore why SQL’s transparent optimization is such a critical feature. 

 

Technorati Tags: , , , , ,

Wednesday, 18 February 2015

Why SQL is the natural language for data analysis

Analytics is a must-have component of every corporate data warehousing and big data project. It is the core driver for the business: the development of new products, better targeting of customers with promotions, hiring of new talent and retention of existing key talent. Yet the analysis of especially “big data environments”, data stored and processed outside of classical relational systems, continues to be a significant challenge for the majority companies. According to Gartner, 72% of companies are planning to increase their expenditure on big data yet 55% state they don’t have the necessary skills to make use of it.

A report by Accenture and GE (How the Industrial Internet is Changing the Competitive Landscape of Industries) found that 87% of enterprises believe big data analytics will redefine the competitive landscape of their industries within the next three years and 89% believe that companies that fail to adopt a big data analytics strategy in the next year risk losing market share and momentum.

Additionally, a recent Cloudera webcast (Pervasive Analytics Gets Real ) noted that, while all businesses understand that analytics drive value, most organizations leverage only an average of 12% of their enterprise data for analytics. This implies that there is a significant amount of business value and opportunity that is being completely missed.

These type of market analysis highlights the huge analytical challenge that many businesses face today. While many companies are willing to invest in the critical area of big data technology to create new “data reservoirs”, for most of them, the same level of focus in relation to the analysis of these new data sources is missing. This means that many will struggle to find a meaningful way to analyze and realize the benefits of this vital investment strategy.

Many of the early adopters of big data are managing the analysis of their data reservoirs through the use of specialized programming techniques on the big data ecosystem, such as MapReduce. This is leading to data being locked inside proprietary data silos, making cross-functional cross-data store analysis either extremely difficult or completely impossible. IT teams are struggling to adapt complex data-silo-specific program code to support new and evolving analytical requirements from business users. In many cases these additional requirements force teams to implement yet more data processing languages. Overall, these issues greatly complicate the conversion of big data led discoveries into new business opportunities: driving the development of new products, capturing increased market share and/or launching into completely new markets. Most companies are searching for a single rich, robust, productive, standards driven language that can provide unified access over all their data and drive rich, sophisticated analysis.

Already, many companies are seeing the benefits of using SQL to drive analysis of their big data reservoirs. In fact, SQL is fast becoming the default language for big data analytics. This is because it provides a mature and comprehensive framework for both data access (so projects can avoid creating data silos) and rich data analysis.

12c

The objective of this series of articles, which will appear over the coming weeks, is to explain why SQL is the natural language for amy kind of data analysis including big data and the benefits that this brings for application developers, DBAs and business users.

Why SQL is so successful

Data processing has seen many changes and significant technological advances over the last forty years. However, there has been one technology, one capability that has endured and evolved: the Structured Query Language or SQL. Many other technologies have come and gone but SQL has been a constant. In fact, SQL has not only been a constant, but it has also improved significantly over time. What is it about SQL that is so compelling? What has made it the most successful language? SQL’s enduring success is the result of a number of important and unique factors:

  • Powerful framework
  • Transparent optimization
  • Continuous evolution
  • Standards based

Over the coming weeks I will explore each of these key points in separate posts and explain what makes SQL such a compelling language for data analysis. So stay tuned…..

 

Technorati Tags: , ,

Wednesday, 28 January 2015

January Edition of Oracle DW and Big Data Magazine

 

Follow us on
Oracle Data Warehouse and Big Data Magazine JANUARY Edition for Customers + Partners


The latest edition of our monthly data warehouse and big data magazine for Oracle customers and partners is now available. It brings together all the most important announcements and videos taken from our data warehouse and big data Oracle product management blogs, Oracle press releases, videos posted on Oracle Media Network and Oracle Facebook pages. Click here to view the JANUARY Edition

FB-Jan-2015

Please share this link http://flip.it/fKOUS with colleagues and via your social media sites.


This magazine is optimized for display on tablets and smartphones using the Flipboard App which is available from the Apple App store and Google Play store





Thursday, 22 January 2015

MATCH_RECOGNIZE and the Optimizer

If you have already been working with the new 12c pattern matching feature you will have probably spotted some new keywords appearing in your explain plans. Essentially there are four new keywords that you need to be aware of:
  • MATCH RECOGNIZE
  • SORT
  • BUFFER
  • DETERMINISTIC FINITE AUTO
The fist three is bullet points are reasonably obvious (at least I hope they are!) but just incase…. the keywords MATCH RECOGNIZE refers to the row source for evaluating the match_recognize clause . The “SORT keyword means the row source sorts the data data before running it through the state machine to find the matches.  
 
The last keyword is the most interesting and is linked to the use of “state machine”, as mentioned in the previous sentence. Its appearance or lack of appearance affects the performance of your pattern matching query. The importance of this keyword is based on the way that pattern matching is performed. To search for a pattern containing a specific set of events we build something called a “state-machine”. At this point I will turn to Wikipedia to provide a definition of a state machine:
 
…a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition…
 
 
The classic example of a state machine is a traffic light which moves through a given sequence of events in a set order and always in that order: Red  ->  Red & Yellow -> Green -> Yellow. The traffic light model can also be viewed as at “deterministic” state machine. This implies that  for every state there is exactly one transition for a given input, i.e. it is not possible to have two different transitions leading out of a particular state. With our traffic light state model it is clear that given a red light state there is only one next transition which is to red & amber.
 
Let’s use our normal stock ticker sample schema that tracks stock market prices for three ticker symbols. Let’s look at two very similar pattern matching queries:
 
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
   AFTER MATCH SKIP TO LAST UP
   PATTERN (STRT DOWN* UP*)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
   AFTER MATCH SKIP TO LAST UP
   PATTERN (STRT DOWN UP)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;


Match Recognize keywords in Otimizer explain plan


Match Recognize keywords in Otimizer explain plan

 

Note that the key difference between the two sql statements is the PATTERN clause. The statement on the left checks for zero or more instances of two different events: 1) where the price in the current row is less then the price in the previous row and 2) where the price in the current row is more then the price in the previous row. The statement on the right checks for only once instance of each down-up pattern. This difference in the definition of the pattern results in different explain plans where the plan on the right includes the key phrase “DETERMINISTIC FINITE AUTO” .

The phrase “DETERMINISTIC FINITE AUTO” means that the state machine that we constructed is deterministic and thus when running the sorted rows through the state machine, we don’t do backtracking (I will write a separate blog post on this topic very soon as it is a key concept in pattern matching. For the moment I will simply point you to Wikipedia page on backtracking, personally I found the section headed “Description of the method” the most useful). The key benefit of building a “DETERMINISTIC FINITE AUTO” plan is that the execution is more efficient when there is no backtracking.

When we analyze the PATTERN clause and build the corresponding state machine we are able to detect deterministic finite automaton by checking the state machine. If any state has two or more outgoing transitions then we regard the state machine as non-deterministic, if any final state is followed by a non-final state, then the state machine is regarded as non-deterministic. At the moment we can only detect a few trivial cases such as PATTERN (A B C), PATTERN (A B+), PATTERN (A B*), etc.
 
The first example of these patterns that we can detect is shown above (see the statement on the right where we have STRT DOWN UP pattern) and the other two examples of these types of deterministic patterns are shown below:
 
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
   PATTERN (STRT DOWN UP+)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
   PATTERN (STRT DOWN UP*)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;

Match Recognize keywords in Otimizer explain plan 



Match Recognize keywords in Otimizer explain plan 

 
For PATTERN (A | B) , or PATTERN (A B+ C) we just regard the state machine as non-deterministic, therefore, the explain plans only contain the keywords MATCH RECOGNIZE (SORT) as shown below:
 
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
   PATTERN (STRT | DOWN | UP)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
   PATTERN (STRT DOWN* UP)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;

Screen Shot 2015 01 27 at 14 56 07

Screen Shot 2015 01 27 at 14 55 46
 
 
Within the current version of 12c (12.1.2) we are not checking the mutual exclusiveness of the DEFINE predicates in detecting a deterministic state machine, therefore, the execution plan defaults to a MATCH RECOGNIZE (SORT) style plan, where we may or may have to use backtracking. Obviously, as we continue to develop the MATCH_RECOGNIZE feature will expand our ability to detect a deterministic state machine which means we will process your patter more efficiently.
 
In summary, if you want the most efficient execution plan then try to define your pattern in such way that we are able to create a deterministic state machine. This assumes, of course, that backtracking is not needed within each partition/data set in order to identify the required pattern (more on this in my next blog post).
  
Hope this information is useful. If you have any questions then feel free to contact me directly (keith.laker@oracle.com).
  

Technorati Tags: , , ,

Wednesday, 10 December 2014

Analytical SQL scripts now on Github

After this year’s OpenWorld I ran a 1-day workshop on analytical SQL for our Data Warehouse and Big Data Global Leaders customers. This was part of the Global Leaders ‘Days with Development' programme. We had a pack room at the Oracle Conference Center and I was very lucky to have Stew Ashton, Technical Architect, BNP Paribas, Dr. Holger Friedrich, CTO at sumIT AG and Joerg Otto, Head of DB Engineering, IDS GmbH co-present with me and have them explain how they were using analytical SQL in their projects. 

 The workshop covered the following topics:

  • Analytic SQL concepts and foundations
  • Analytic functions for reporting
  • Analytic functions for aggregation
  • More advanced and new 12c features: Pattern Matching
  • SQL Model clause

For the workshop I created a comprehensive slide deck (I will post the presentation shortly on our OTN home page) which included code samples and explain plans to highlight the key benefits of using our analytical SQL features and functions. The great news is that I now have a repository for analytical SQL code samples on the Github repository. To kick things off on this new repository I have posted all the SQL scripts that I created for this workshop so you can now download and work through a series of use cases that explain how to use window functions, intelligently aggregate data, manipulate rows and columns of data, find patterns and create what-if scenarios. Below is my repository home page where you can download the code:

Github Repository for Analytical SQL

 

So what is Github?

At a simple level, it is an online version control system (and a lot more!) that stores and manages modifications to code script files within in a central repository. Its key benefit is that it makes it very easy for developers to work together on a common project. This environment makes it easy to download a new version of code scripts, make changes, and upload the revisions to those files. Everyone can then see these new changes, download them, and contribute. This system is very popular with developers so we have decided to join this community and make our SQL scripts available via this site. It is the ability to “collaborate” which is most important for me.

To help you get started there is a great section on the website called “Set Up Git”. If like me you are using a Mac then GitHub has a Mac client! You can use it without ever touching the command line interface (which can be a little frustrating at times!).

You can contribute too!

It would be great if you could contribute your own scripts to this repository so I can build up a library of scripts for analytical SQL. All you need to do is create an account on Github, search for the analytical SQL repository and then either download the repository as a zip file or use the “Clone in Desktop” option. What I want to do is build up a series of well documented use cases and when we have enough content then I will create industry specific folders to help organize the content.

So my new repository is now live, please have a look and feel free to upload your own scripts that show how you have used analytical SQL to solve specific business problems within your projects. Looking forward to lots of files arriving into this great new repository. 

 

Technorati Tags: , , ,

Friday, 5 December 2014

Part 5: X-Charging for Sandboxes

This is the next part in my on-going series of posts on the topic of how to successfully manage sandboxes within an Oracle data warehouse environment. In Part 1 I provided an overview of sandboxing (key characteristics, deployment models) and introduced the concept of a lifecycle called BOX’D (Build, Observe, X-Charge and Drop). In Part 2 I briefly explored the key differences between data marts and sandboxes. Part 3 explored the Build-phase of our lifecycle. Part 4 explored the Observer-phase of our lifecycle so we have now arrived at the X-Charge part of our model.

To manage the chargeback process for our sandbox environment we are going to use the new Enterprise Manager 12c Cloud Management pack, for more information visit the EM home page on OTN

Why charge for your providing sandbox services? The simple answer is that placing a price or cost on a service ensures that the resources are used wisely. If a project team incurred zero costs for their database environment then there is no incentive to evaluate the effectiveness of the data set and the cost-benefit calculation for the project is skewed by the lack of real-world cost data. This type of approach is the main reason why sandbox projects evolve over time into “production” data marts. Even if the project is not really delivering on its expected goals there is absolutely no incentive to kill the project and free up resources. Therefore, by not knowing the cost, it is impossible to establish the value.

The benefits of metering and x-charging are that it enables project teams to focus on the real value of their analysis. If all analysis is free then it is almost impossible to quantify the benefits or costs of a particular analysis. Project teams can also use x-charging as a way to adjust their consumption of resources and control their IT costs. It benefits the IT team as it enables them to achieve higher utilisation rates across their servers. Most importantly the cost-element attached to running a sandbox acts as a string incentive to finalize and shutdown sandboxes ensuring that they do not morph into uncontrolled marts.

There is a fantastic whitepaper on this topic, which explores the much wider topic of metering and chargeback within a cloud environment which is available on the Enterprise Manager webpage, click here to view the whitepaper.

Overview

Enterprise Manager 12c uses the rich monitoring and configuration data that is collected for Enterprise Manager targets as the basis for a metering and chargeback solution. Enterprise Manager Chargeback provides the administrator with:

  • Assignment of rates to metered resources
  • Management of a cost center hierarchy
  • Assignment of resources to cost centers
  • Usage and charge-back reports

This set of features can be used to implement a chargeback regime for analytical sandboxes. There is a rich set of API’s that allow you to extract metering and charge data so that it can be incorporated into enterprise billing solutions such as Oracle Billing and Revenue Management application.

Setting up a x-charging framework for our analytical sandboxes involves three key stages:

  • Creating chargeback plans for resources and database options
  • Defining users and cost centers to “take” charges
  • Reporting on usage and charges
Let’s look at each of this stages in more details:
 

Step 1: Creating charge plans

A Charge Plan is created by the DBA and it defines the metered resources along with the associated rates. Enterprise Manager Chargeback offers two types of Charge Plan – Universal Charge Plan and Extended Charge Plans.

The Universal Charge Plan is the simplest way to enable chargeback for sandboxes and is probably adequate for the vast majority of projects. It contains just 3 metrics:

  • CPU Usage
  • Memory Allocation
  • Storage Allocation

and the DBA can set the rates for each metric as shown here:

 

Charge Plans

 

Even with this basic profile you can implement quite sophisticated charging models. It is possible to vary the rates used in charge calculations by month/period. Each “period" is known as a “Reporting Cycle”. If rates are modified, the updated rates will be used to re-calculate the charges for all days from the first of the current period onwards.

Some projects may need access to analytical features that are costed database options. For example, if a project needs to build data mining models then they will require the Oracle Advanced Analytics option. Alternatively, to support semantic analysis or social network analysis requires the use of the spatial and graph option. Extended Charge Plans allow the DBA to factor in charging for database options alongside the standard charging metrics of the Universal Charge Plan. For database options it makes sense to make use of the ability to create fixed cost charges to effectively “rent-out" each option for each sandbox environment. Of course if a project suddenly decides it needs access to a specific type of analytical option, such as in-memory, it simply a case of adding the relevant cross-charge item to the profile for the specific sandbox and the project team can start using that feature right away (assuming the database instance has the correct options pre-installed).

Charge Plans Extended

 

Step 2 Setting up users and costs centres

When administering a self-service analytic sandbox, it is necessary to meter resource consumption for each self-service user. These costs then need to rolled up into an aggregate level such as cost centers to generate a total charge for each department/project-team accessing the sandbox. For ease of administration and chargeback the self-service users can be represented within a Cost Center structure. Each cost center contains list of “consumers” who have access to the sandbox and of course its associated resources. The cost centers can be organized in a hierarchical fashion to support aggregation and drill down with the cost analysis or billing reports. A typical hierarchical cost centers within a project might look something like this:

Cost center hierarchy

Step 3: Chargeback Reports

Any chargeback solution will involve reporting so that users can understand how their use of sandbox (storing data, running reports etc) translates to charges. Enterprise Manager provides reports that show both resource usage and charging information. This is broken down into two categories of reports: summary and trending reports.

Summary Reports show information related to charge or resource utilisation broken down by cost center, target type and resource. These reports allow both sandbox owners and business users to drill down and quickly assess analyse charges in terms of type of target (database instance, host operating environment, virtual machine etc) or cost centers as shown below.

EM summary report

Trending Reports These reports show metric or charge trends over time and are useful for project teams who want to see how their charges change over time. At an aggregate level the I.T. team can use this information to help them with capacity planning. A report of CPU usage is shown below. 

EM trend report

What’s missing?

While this latest version of enterprise manager has some great features for managing analytical sandboxes it would be really useful if the project team could enter a total budget for their sandbox. This budget could then shown on graphs such as the trending report. It would be useful to know how much of the budget has been spent, how many days-periods of budget remain based on current spending patterns etc. Of course once the budget has been used up it would be useful if the sandbox could be locked - this would focus the minds of the project team and ensure that a sandbox does not evolve into a “live” data mart. Which brings us nicely to the next blog post which will be on the final part of our lifecycle model: ensuring that sandboxes have a “Drop” phase.

If you want more information about how to setup the chargeback plans then there is a great video on the Oracle Learning Library: Oracle Enterprise Manager 12c: Setup and Use Chargeback.

Sunday, 16 November 2014

Oracle Data Warehouse and Big Data Magazine NOVEMBER Edition

 

Follow us on Facebook Twitter Blogger
The latest edition of our monthly data warehouse and big data magazine for Oracle customers and partners is now available. It brings together all the most important announcements and videos taken from our data warehouse and big data product management blogs, press releases, videos posted on Oracle Media Network and our Facebook pages. Click here to view the November Edition


Please share this link http://flip.it/fKOUS with your colleagues and via your social media sites.


This magazine is optimized for display on tablets and smartphones using the Flipboard App which is available from the Apple App store and Google Play store