Google Analytics

Tuesday, 23 August 2016

Your Essential Online Session and Hands-on Lab Calendars for #oow16

It’s almost time for OpenWorld. Only three weeks to go! With so much to see and learn at Oracle OpenWorld we are doing our best to make sure that everyone get the most from this year’s conference. Therefore, to help you get prepared and organized we have created a series of online calendars which list all the must-see data warehousing and big data sessions, labs and key events.

Just look at the agenda below - we have packed this year’s schedule with the very best must-see sessions and must-attend labs by Oracle product managers and key customers.

 

The above agenda is built using Google Calendar and is available for use with other calendar applications that allow you to subscribe to online calendars. To make the process as easy as possible we have created a range of calendars mapped to specific areas of interest such as: cloud, data warehousing, analytics, big data and unstructured data.  The following links can be used to access our OpenWorld calendars via your own calendar applications using the following links:

If you just want to cherry-pick particular topics then select from the following links to focus on your particular areas of interest:

 

Hope these links are useful. Looking forward to seeing you at OpenWorld in September. Have a great conference.

 

Technorati Tags: , , , , ,

Friday, 12 August 2016

Must-See Guide for Data Warehousing and Big Data at oow16


Your Must-See Guide for Data Warehousing and Big Data #OOW16


There's so much to see and learn at Oracle OpenWorld - it provides more educational and networking opportunities than any other conference dedicated to Oracle business and technology users. To help you get the most from this year's event I have prepared a comprehensive guide which lists all the must-see data warehousing and big data sessions and labs along with lots of related content. It is divided into the following chapters:
  • Key Highlights from 2015
  • List of Data Warehouse and Big Data Presenters
  • Must-See Sessions and Labs
  • Details of Appreciation Event
  • Links to DW and Big Data product web pages, blogs, social media sites
  • Information about the #oow16 smartphone app
  • Maps to help you find your way around Moscone
This guide is available in both Apple iBooks and PDF formats. The PDF version will work on any smartphone, tablet and/or computer and can be downloaded by clicking here

The iBooks version is available via the Apple iBooks Store and the link is shown on the left-hand side. The iBooks format will open on iPads and Mac computers via the relevant iBooks App. Please refer to the Apple website for more information.
 
   

Friday, 13 May 2016

New pattern matching tutorial on LiveSQL

LiveSQL home page on livesql.oracle.com

If you always wanted to try our new SQL pattern matching feature, MATCH_RECOGNIZE, but never had access to a Database 12c instance then you really need to checkout the our great new LiveSQL playground environment. LiveSQL is a great place to learn about all the new features of Database 12c along with all the existing features from earlier releases.

The new tutorial is called “Log file sessionization analysis with MATCH_RECOGNIZE” and you can view by clicking here. This tutorial is designed to help show how you can run sessionization analysis on application logs, web logs, etc. Using a simplified table of click data you create a sessionization data set which tracks each session, the duration of the session and the number of clicks/events. It’s all fully annotated and there are links to supporting topics where you can get more information. The objective is to introduce you to some of the important keywords and concepts that are part of the MATCH_RECOGNIZE clause.

I have put together a quick video that shows how to access my new pattern matching tutorial so simply click on the image below to access the video:

 

LiveSQL video showing using new pattern matching tutorial

 

There are lots of code samples and tutorials for analytical SQL already loaded and available for you to run. Ever wanted to do a mortgage calculation using the SQL Model clause? Well Carsten Czarski, Oracle Database Developer guru, has created a great script that you can run to see how it works, just follow this link - https://livesql.oracle.com/apex/livesql/file/content_CA67VTHEVZZPYG94E5HWG6RWC.html

The AskTom team (Connor and Chris) have uploaded scripts that they have created in response to questions posted on the AskTom forum. For example, here is the AskTom answer for finding the min/max rows within price change groups: https://livesql.oracle.com/apex/livesql/file/content_CK3GKD93H9AE0MJ3QSAQUP97N.html.

I hope this is useful and don’t forget to share your own scripts and tutorials as this site is community driven, the more you upload the more we all learn.

Looking for more Information

Use the tag search to see more information about pattern matching or SQL Analytics or Database 12c.

Technorati Tags: , , , ,

Monday, 25 April 2016

SQL Pattern Matching Deep Dive - Part 3, greedy vs. reluctant quantifiers

A pciture of a greedy pig

Picture courtesy of Pixabay

Welcome to the third post in this deep-dive series on SQL pattern matching using the MATCH_RECOGNIZE feature that is part of Database 12c. 

In the first part of this series we looked at a wide range of topics including ensuring query consistency, how to correctly use predicates and how to manage sorting. In the second part we looked at using the built-in measures to understand how a data set is matched to a pattern. 

In this post I am going to review the concepts of greedy and reluctant quantifiers. I will breakdown this down into a number of areas: 1) Overview of regular expressions, 2) understanding quantifiers, and 3) greedy vs. reluctant quantifiers. The examples in this post use the built-in measures to help show the difference between greedy and reluctant matching. If you are not familiar with the MATCH_NUMBER() function or the CLASSIFIER() function then please take some time to read the second post in this series

Overview of regular expressions

The MATCH_RECOGNIZE clause has a lot of component parts but probably the most important one is the PATTERN component which consists of regular expressions. But what exactly is a regular expression?

What is a regular expression?

Here is a very basic definition from Wikipedia:

a regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace” - like operations

Wikipedia: https://en.wikipedia.org/wiki/Regular_expression

Oracle has for quite some time supported regular expressions within the Oracle Database. The 12c Database Development Guide has a section on regular expressions where these are defined as:

A regular expression specifies a search pattern, using metacharacters (which are, or belong to, operators) and character literals

12c Database Development Guide

so you might see something like this in SQL code:

WHERE REGEXP_LIKE((hr.employees.first_name, '^Ste(v|ph)en$')

which uses a regular expression to identify employees with the first name of Steven or Stephen. There are a number of different REFEXP_XXXX functions and these are listed in Table 8-1 of the documentation

In terms of MATCH_RECOGNIZED, Oracle has implemented regular expressions slightly differentlyEach regular expression is specified by a set of strings based on two key parts: tokens and quantifiers. The token is a symbol the definition of which is described in the DEFINE clause. The quantifier after a token or group of tokens specifies how often that token-group is allowed to occur.

First, let’s explore the concept of tokens.

Grouping tokens 

As indicated earlier, tokens can be grouped and managed using various methods. So before we go any further lets quickly explore the four key concepts relating to tokens: concatenation, grouping, alternation, and permutes. They are amazingly powerful and deeply expressive and perfectly suited to the problem of describing patterns.  

1. Concatenation

Used to list two or more items in a pattern that have to be matched in the specified order. Items are concatenated when there is no operator sign between two successive items. For example: PATTERN (A B C)

Using concatenation we can rewrite our usual pattern definition for w-shapes as follows:

SELECT symbol, first_x, last_z 
FROM ticker
MATCH_RECOGNIZE (
 PARTITION BY symbol ORDER BY tstamp
 MEASURES FIRST(x.tstamp) AS first_x,
          LAST(z.tstamp) AS last_z
 ONE ROW PER MATCH
 PATTERN (X+ Y+ W+ Z+)
 DEFINE X AS (price < PREV(price)),
        Y AS (price > PREV(price)), 
        W AS (price < PREV(price)),
        Z AS (price > PREV(price) AND z.tstamp - FIRST(x.tstamp) <= 7 ))
WHERE symbol = 'OSCORP';

..can now be rewritten as follows:

SELECT symbol, first_x, last_y 
FROM ticker
MATCH_RECOGNIZE (
 PARTITION BY symbol ORDER BY tstamp
 MEASURES FIRST(x.tstamp) AS first_x,
          LAST(y.tstamp) AS last_y
 ALL ROWS PER MATCH
 PATTERN ((X+ Y+) (X+ Y+) Z)
 DEFINE X AS (price < PREV(price)),
        Y AS (price > PREV(price)), 
        Z AS LAST(y.tstamp) - FIRST(x.tstamp) <= 7 )
WHERE symbol = ('OSCORP');


Initially, I was hoping to reduce the DEFINE statement to just two lines. I did try to rewrite the pattern without the additional “z” variable, by simply adding the AND clause from the first example, which tests for the duration of the W-shape, to the Y pattern definition. Whilst this produced the same result, the test for the overall duration of the pattern, i.e. whether it is less than or equal to 7 days, gets applied to the first (X Y) scenario as well as the second (X Y) scenario, when we actually want to test the duration of the complete W-shape. Therefore, I had to introduce Z to manage the test for the duration of the W-pattern. Hope that is clear!

2. Alternation

Matches a single regular expression from a list of several possible regular expressions. The alternation list is created by placing a vertical bar (|) between each regular expression. Alternatives are preferred in the order they are specified. As an example, PATTERN (A | B | C) attempts to match A first, if A is not matched, it attempts to match B, if B is not matched, it finally attempts to match C.

3. Grouping

Treats a portion of the regular expression as a single unit, enabling you to apply quantifiers to the whole group. A grouping is created using parentheses. For example, PATTERN ((A B){3} C). This attempts to match the group (A B) three times and then seeks one occurrence of C.

4. Permute

The PERMUTE syntax may be used to express a pattern that is a permutation of simpler patterns. Let’s consider a relatively simple example: PATTERN (PERMUTE (X{3}, B C?, D)). This is equivalent to the following to writing the following:

PATTERN ((X{3} B C? D)
       |(X{3}DBC?)
       |(BC?X{3}D)
       |(BC?DX{3})
       |(DX{3}BC?)
       |(DBC?X{3}))

Note that PERMUTE is expanded lexicographically. This is an important point because it means that alternatives are attempted in the order written in the expansion and that has an impact when we have to start back-tracking (don’t panic this specific topic will be discussed in a future post) after a complete match fails because we have to step back through the alternative permutations in the correct order.

So as you can see we have a very rich way of managing how a group of tokens should be applied to a data set: are they AND’ed together or are they OR’ed, are they grouped together, do different permutations need to be tested against the data set etc.

Getting to know quantifiers

Now we understand about tokens lets look at how we control the frequency a token should appear within our patten. The most common quantifiers used by developers are: 1) question mark, indicates zero or one match, 2) asterisk, indicates the need for zero or more matches and 3) plus sign, indicates the need for one or more matches.

Of course, Oracle supports a large library of built-in quantifiers, which are sometimes referred to as POSIX basic and extended quantifiers:

* 0       = or more matches
+ 1      = or more matches
? 0      = or 1 match
{n}       = exactly n matches
{n,}      = n or more matches
{n, m}  = at least n but not more than m (inclusive) matches
{, m}    = at most m (inclusive) matches

Here are a couple of simple examples of using quantifier operators:

  • PATTERN A*  - matches 0 or more iterations of A
  • PATTERN A{3,6} - matches 3 to 6 iterations of A
  • PATTERN A{,4}  - matches 0 to 4 iterations of A

Note that at this stage we have not defined what ”A“ actually means - that is covered in the DEFINE clause. There are a couple of special quantifiers which are referred to as “anchors”. What is an “anchor”?

Special quantifiers - Anchors

These are partition wide quantifiers. They work slightly different to other quantifiers because they operate in terms of positions rather than rows. They match a position either at the start or end of a partition or they can be used in combination to span an entire partition. The two start and end operators are:

  • ^ matches the position before the first row in the partition.
  • $ matches the position after the last row in the partition.

As an example the PATTERN (^ A+ $) will only result in a  match if all rows in a partition satisfy the condition for A, i.e. the resulting match spans the entire partition.

Special quantifiers - excluding a pattern

When using ALL ROWS PER MATCH with either the OMIT EMPTY MATCHES or SHOW EMPTY MATCHES sub-options, rows matching a portion of the PATTERN may be excluded from the row pattern output table. The excluded portion is bracketed between {-  and  -} in the PATTERN clause. 

This example finds the longest period of increasing prices where the price is not less than ten at the start of the period of increasing prices: 

PATTERN ( {- A -} B+ {- C+ -} )
SUBSET S = (A,B)
DEFINE A AS A.Price >= 10,
       B AS B.Price > PREV(B.Price),
       C AS C.Price <= PREV(C.Price)

NOTE this is probably obvious… you cannot use the “ALL ROWS PER MATCH WITH UNMATCHED ROWS” syntax when there are exclusions in the PATTERN clause.

The “exclude” feature acts like a filtering process within the pattern matching process. It is possible to achieve almost the same result by reversing the definition of A and removing the exclusion of A within the pattern definition and dropping C completely. However, more rows are returned because pattern A is now part of the output, i.e. rows are actually matched to A and returned.

It’s important to remember that the exclude feature only removes rows from the output, they are not excluded from the definitions of union pattern variables, or from the calculations within the DEFINE or MEASURES clauses. Therefore, you can remove rows from the output but still include those rows in calculations that need access to the whole pattern such as SUM() or AVG() or MAX() or MIN() operations.

One benefit of this is that , assuming you want to remove rows from the output, there is no need to make two passes over the data - once to match the pattern and then via the WHERE clause to filter the rows. Everything is managed within the MATCH_RECOGNIZE processing.

For example, let’s consider the following code:

SELECT symbol,
       tstamp,
       a_tstamp,
       matchno,
       row_count,
       classfr,
       price,
      TRUNC(avgp,1)
FROM Ticker MATCH_RECOGNIZE (
  PARTITION BY Symbol
  ORDER BY tstamp
  MEASURES FINAL AVG(Price) AS avgp,
           COUNT(*) as row_count,
           CLASSIFIER() AS classfr,
           MATCH_NUMBER() AS matchno,
           FIRST(a.tstamp) AS a_tstamp
  ALL ROWS PER MATCH
  PATTERN ( {- A -} B+ {- C+ -} )
     SUBSET S = (A,B)
  DEFINE A AS A.Price >= 10,
         B AS B.Price > PREV(B.Price),
         C AS C.Price <= PREV(C.Price)

WHERE symbol= 'ACME'
ORDER BY symbol, tstamp;

generates the following output:

Output from exclude syntax to remove rows

If you look at the count of the number of rows (row_count) within each match it does not match the number of rows in the output table. For example, for the first match there are four rows returned but the count of the number of rows actually matched is five! That is because the rows mapped to A have been removed from the output but were included in the calculations defined in the MEASURES clause. Very clever!

Of course you can’t use the exclude syntax with if WITH UNMATCHED ROWS is specified for ALL ROWS PER MATCH. In contrast the exclusion functionality does not affect the AFTER MATCH SKIP TO option. It is valid to SKIP to a variable that appears is marked as excluded.

 

Greedy vs. reluctant

Now we have the groundwork in place we can start to explore the concept of greedy vs. reluctant quantifiers. Before we start a quick warning about using SQL Developer for these examples. At the moment, if you try to include a question mark (?) as part of the your pattern then there is an issue because this symbol is used to catch parameters. Therefore, if you try to execute a MATCH_RECOGNIZE statement that includes reluctant quantifiers you will get an error message: 

Missing IN or OUT parameter at index:: 1

I don’t think there is a fix for this at the moment but the SQL Developer team is aware of the issue and working on a fix. See this thread on OTN. Of course, this issue will also impact any applications that use the JDBC drivers (and ODBC?) because a question mark is taken as a parameter marker. There is a workaround for JDBC connections described in the thread on OTN but it does not appear to be very elegant - at least not to me!

Anyway…..Pattern quantifiers that attempt to match as many instances as possible of the token to which they are applied are referred to as greedy, i.e. they will gobble up as many rows as possible. This is the default mode of operation.

In contrast a reluctant quantifier will attempt to match as few instances as possible of the token on which they are applied. To switch a quantifier from greedy to reluctant you simply add a question mark ? as additional suffix to the pattern. For example: 

PATTERN (strt down+? up+)
DEFINE
   down AS (price <= PREV(price)),
   up AS (price >= PREV(price))

searches for one occurrence of strt followed by at least one occurrence of down and one or more occurrences of up. BUT - and this where greedy vs. reluctant comes into play -  if a row could be matched to either down or up then the down pattern will defer to the up pattern. If down had a greedy quantifier then  if a row could be matched to either down or up then the down pattern will win. Hope that makes sense, if not then …

A simple example…

I have taken the sample ticker data which you can play with on LiveSQL and made a small change to the data for ticker symbol ACME. I changed the price on Apr-16 to 14 so that there are now three consecutive rows (15-Apr, 16-Apr and 17-Apr) with the same value (14) for price. If we graph the price for ACME it now looks like this:

Updated price data for ACME

 

You can see that we now have a flat spot in our ticker stream data. If we simplify our usual w-shape pattern so that we are just looking for V-shaped patterns and change the definition of the down and up variables so that there is potential for ties by using <= and >= in their respective definitions then we can explore this idea of greedy quantifiers by using the following code:

SELECT symbol, tstamp, price, mn, pattern,
first_down, first_price, last_up, last_price
FROM ticker MATCH_RECOGNIZE (
  PARTITION BY symbol ORDER BY tstamp
  MEASURES MATCH_NUMBER() AS mn,
           CLASSIFIER() as pattern,
           FIRST(strt.tstamp) AS first_down,
           FIRST(strt.price) as first_price,
           LAST(up.tstamp) AS last_up,
           LAST(up.price) as last_price
  ALL ROWS PER MATCH
  PATTERN (strt down+ up+)
  DEFINE
        down AS (price <= PREV(price)),
        up AS (price >= PREV(price))

)
WHERE symbol = 'ACME’
ORDER BY symbol, tstamp;

 

this will generate the following output:

Results from MATCH_RECOGNIZE using greedy quantifiers

 

Note that on April 17 we match the row to the down variable because down is being greedy. It could have been matched to the up variable because the price is actually equal to the previous row but down took precedence.

What if we change down and make it reluctant by using the ? quantifier:

SELECT symbol, tstamp, price, mn, pattern, 
first_down, first_price, last_up, last_price
FROM ticker MATCH_RECOGNIZE ( 
  PARTITION BY symbol ORDER BY tstamp 
  MEASURES MATCH_NUMBER() AS mn,
           CLASSIFIER() as pattern,
           FIRST(strt.tstamp) AS first_down,
           FIRST(strt.price) as first_price,
           LAST(up.tstamp) AS last_up,
           LAST(up.price) as last_price
  ALL ROWS PER MATCH 
 
PATTERN (strt down+? up+) 
  DEFINE 
        down AS (price <= PREV(price)),
        up AS (price >= PREV(price))
)
WHERE symbol = 'ACME’
ORDER BY symbol, tstamp;

 

this will generate the following output:

Pattern matching showing reluctant quantifier

Now you can see that because down is reluctant it matches as few occurrences as possible and where a row could be mapped to either down or up it is up that takes precedence. 

Why is this important?

Obviously this can impact the way that rows are matched to your pattern. Therefore, you need to think carefully about how you are going to manage the situation where rows could be matched to more than one variable - do you have preference for which variable wins? If you have measures that are tied to specific variables then these will be impacted by whether the variable is greedy or reluctant. Obvious examples are where you are performing some sort of calculation such as averages, sums or counts.

Therefore, when you are constructing your pattern please think very carefully about how greedy you want your matching process to be as it processes your data set.

 

What’s next?

In the next post in this series I am going to take a quick diversion to review how we more accurately manage the output from MATCH_RECOGNIZE. In the last post we briefly looked at differences between ONE ROW PER MATCH vs. ALL ROWS PER MATCH. However, there three different permutations of the ALL ROWS syntax and these seem to be causing some confusion. Stay tuned if you want to learn about the key differences between: 1) SHOW EMPTY MATCHES, 2) OMIT EMPTY MATCHES and  3) WITH UNMATCHED ROWS. All will become clear in my next post.

Feel free to contact me if you have an interesting use cases for SQL pattern matching or if you just want some more information. Always happy to help. My email address is keith.laker@oracle.com

 

Looking for more Information

Use the tag search to see more information about pattern matching or SQL Analytics or Database 12c.

Technorati Tags: , , , ,

Thursday, 14 April 2016

Is an approximate answer just plain wrong?

We are starting to see a significant change in the way we analyze data as a result of the growth of interest in big data and the newer concept of Internet of Things. Ever since databases were first created everyone has been obsessed, quite rightly so, with ensuring queries returned the correct answer - i.e. precise, accurate answers. This key requirement is derived from the need to run operational, transactional applications. If we check our bank balance online we want the figure we see to be accurate right down to the last cent and for a good reason. Yet increasingly both as part of our online as well as offline experiences we deal with numbers that are not 100% accurate and somehow we manage to make good use of these approximate answers. Here are a couple of examples of where we already using approximations: route planning on our smartphones and crowd counting information in newspapers:

How long does it take to get from A to B?

If I need to get to San Francisco airport from our offices in Redwood Shores to catch a flight then I could use any one of a large number of route planners. My personal favorite at the moment is Waze purely and simply because it is relatively new and I like the icons it uses. Punch in my starting point and destination and Waze will offer me three different routes and three different journey times. I can see other users along my route and they will be used to update the details about my route: level of traffic flows and time taken between specific points. 

Waze Map of San Francisco

What this means is that the time shown for each route is an estimate based on the information available at that time. Additional information along each route is provided using colour codes as follows:

  • Dark red line - standstill traffic.
  • Light red line - heavy traffic.
  • Orange line - moderate traffic.
  • Yellow line - light traffic.

This information feeds into the calculation for Estimated Time of Arrival (ETA). Now the critical point here is that the ETA is, as the first letter suggests, an “estimate”. What is interesting is that Waze does not tell me is how accurate or how confident it is about each estimate that it has calculated. The question is do I care? If I select the first route and take 101 Northbound does it really matter if it takes me 14mins or 18mins? I can guarantee you that I build in some slippage time because I frequently travel this route and I sort of know what the traffic flow is like during the day - I can see 101 from the seventh floor of 400 Building so a quick visual check is relatively easy. 

What would I do if Waze returned a time of 6 hours for the 101 Northbound route? First I would look for alternatives and if they were also showing excessively long journey times then I might do some web searches to see if there was incident that was affecting the traffic. I might decide to check times for the Caltrain and decide to go by train to the airport.  

At the end of the day, Waze provides me with a good approximation of how long it will take me to get to SFO from Redwood Shores - it is most definitely not giving a down-to-the-second precise figure for travel time. However, even with an approximation it means I can plan out the rest of my day up to the point when I need to leave for the airport. If when, I am about to leave, the travel times are looking a little too long then I will look for alternative ways of getting to my destination.

Counting Crowds

This approach of using approximations applies to a whole series of different activities within our daily lives. The Caltrain station shows the expected time of the next train and keeps on updating it right up until the train actually arrives in the station.  When news feeds report the number of people at large scale public events, even using video/CCTV surveillance, they always provide rounded numbers: “over 5,000 people attended today’s rally in…..”. They don’t report that 5,732 people attended an event. The surveillance images and videos will record pickup everyone at the event. The issue is the time it takes to identify and mark each individual person. In the days before big data processing the public authorities would simple use a square that equated to a particular number of people and see how many complete squares could be applied to a specific picture. There is an interesting article on “crowd counting” here: http://www.popularmechanics.com/science/a7121/the-curious-science-of-counting-a-crowd/

Okay, but what has all this got to do with analytics?

Approximations and Analytics

Approximations are actually everywhere yet we rarely use them in business. Analytics is seen as dealing purely with precise, accurate (100% accurate) figures. In most cases you definitely do not have an option. The Securities and Exchange Commission does not allow companies to post approximations of their accounts even if you could provide “confidence levels” and “error rates” for the data feeding into key reports such as P&L, cash flow and balance sheet. So are there are real-world business scenarios where approximate answers are good enough? In my opinion there are quite a few use cases where approximations can speed up the decision making process:

1) Discovery analytics: data analysts often slice and dice their dataset in their quest for interesting trends, correlations or outliers. If your application falls into this type of explorative analytics, getting an approximate answer within a second is much better compared to waiting twenty minutes for an exact answer. In fact, research on human-computer interaction has shown that, to keep business users engaged and productive, the response times for queries must be below 10 seconds. In particular, if the user has to wait for the answer to their query for more than a couple of seconds then their level of analytical thinking can be seriously impaired.

2) Market testing: most common use case for market testing is around serving ads on websites. This is where two variants of a specific ad (each with a group of slightly different attributes such as animations or colour schemes) are served up to visitors during a session. The objective is to measure which version generates a higher conversion rate (i.e. more click-throughs). The analytics requires counting the number of clicks per ad with respect to the number of times each ad was displayed. Using an approximation of the number of click-throughs is perfectly acceptable. This is similar to the crowd-counting problem where it is not really necessary to report exactly how many people joined a rally or turned up to an event.

3) Root cause analysis:  contrary to perceived wisdom, this can in fact be accomplished using approximations. Typically RCA follows a workflow model where results from one query trigger another query, which in turn triggers another related query. Approximations are used to speed up the decision as to whether or not to continue with a specific line of analysis. Of course you need to incorporate the likelihood of edge cases within your thinking process because there is the danger that the edge values will get lost within the general hashing process. This comes back to the understanding your data in much the same way you can evaluate alternative routes offered up by Waze.

There are many other examples of using approximations as part of analytical process. What is important to realize is that approximations can be very useful within the analytical process. However, many people will raise issues with using approximate results, such as… 

Approximate functions just a faster way to get the wrong answer?

I have heard this mentioned quite a few times, even by Oracle Ace Directors, following the release of approx_count_distinct in Database 12c (12.1.0.2). The implication being that an approximate answer is factually incorrect. My assumption is that most people (including Ace Directors) assume that the database is not processing the complete data set and this seems to be a very common misconception. At a recent conference someone asked if you needed to make sure that optimizer statistics were to be up to date on a table before using  approx_count_distinct. For the record: the answer is NO! 

Approximate processing does not use sampling. When computing an approximation of the number of distinct values within a data set the database does process every value for the specified column. The mathematical approach that we have used for  approx_count_distinct is called HyperLogLog (HLL) and I will let you Google this term if you want more information but here is a quick summary:

The Hyperloglog algorithm uses a randomization technique. The randomization is achieved by applying a hash function to every value of the column. The algorithm observes the maximal number of trailing zeros among all the hash values. It's easy to see that in random data, a sequence of n zero bits will occur once in every 2^n elements on average. Therefore if the maximal number of trailing zeros we have seen is n, we estimate the NDV to be 2^n. 

This is not a great estimator though. At best it can give us a power of two estimate of the NDV. To reduce the large variance, stochastic averaging is used. After we apply the hash function to each value, we use part of the output to split values into one of many buckets. Supposing we want 1024 buckets, we can take the lower 10 bits of the hash value as a bucket number, and use the remainder of the hash value to count trailing 0s. 


Accuracy Guarantee

A natural question that arises is how accurate is an approximate answer?

At this point we need to introduce some additional terminology that you need to get comfortable using: confidence interval and error rate. It is these terms that I expect are creating the impression in many peoples minds that an approximate answer is an incorrect answer. In reality what we are saying is that the answer has a range of values. There is a specific formula for computing the error rate when using HLL which is dependent on the number of values being counted and the number of hash buckets used. In very broad general terms during our test the level of accuracy of the estimated number of distinct values vs. the actual number of distinct values has been in excess of 97% with 95% confidence. Therefore, we are 95% confident that the approximate result is within +3% and -3% of the answer. 

If we counted the number of distinct sessions within a weblog for a given period of time and approx_count_distinct returned a value of 36,383 then we can conclude that we are 95% confident that the actual value would be between 35,291 and 37,474.  Business users who have adopted six-sigma data-driven approach and methodology are used to these two concepts.  There is an excellent description of the two important concepts on the iSixSIgma site: Margin of Error and Confidence Levels Made Simple.

Currently for approx_count_distinct, we do not allow you to specify an error bound operator and we do not expose the error estimate. The reason for this is that the approx_count_distinct algorithm we are using to compute values. The function uses the hyperloglog methodology which gives a probabilistic error. This means that there is no guarantee that the error is under a certain number. The best we can say is with x% probability (i.e., confidence level -  є) the estimate is within [actual – є, actual + є]. 

Summary

In reality we all use approximations in our everyday lives: from satnavs to surveys in newspapers and magazines. It’s a relatively small step to accept the same approach in relation to a range of business problems. In the title for this post I posed the question: Is an approximate answer just plain wrong?

The answer is obviously NO! Approximate answers are just different. They definitely have a valuable place within a data warehouse and big data projects and as a part of a wider business intelligence context. They are a faster more efficient way to analyze big data. It makes sense to embrace approximate analytics within your projects. Get using approx_count_distinct, you know it makes sense!

 

Technorati Tags: , , , , ,

Tuesday, 12 April 2016

SQL Pattern Matching Deep Dive - Part 2, using MATCH_NUMBER() and CLASSIFIER()

Welcome to the second post in this deep dive series on SQL pattern matching using the new MATCH_RECOGNIZE feature that is part of Database 12c. In the first part of this series we looked at the areas of ensuring query consistency, how to correctly use predicates and how to manage sorting.

In this post I am going to review the two built-in measures that we have provided to help you understand how your data set is mapped to the pattern that you have defined. This post will breakdown into three areas: 1) a review of the built-in measures, 2) understanding how to control the output (number of rows returned) and lastly I will bring these two topics together with some examples.

 

Loking for w-shaped patterns in ticker data

Built-in measures

We provide two built-in measures to help you understand how your data maps to a pattern: MATCH_NUMBER() and CLASSIFER(). These two functions can only be included as part of the MEASURE clause. It is not possible to use them within the DEFINE clause because in the majority of cases they information returned would not make sense.

To get you started here is a quick definition of each function:

MATCH_NUMBER()

You might have a large number of matches for your pattern inside a given row partition. How do you tell these matches apart? This is done with the MATCH_NUMBER() function. Matches within a row pattern partition are numbered sequentially starting with 1 in the order they are found. This numbering starts at 1 within each row pattern partition, because there is no linked ordering across row pattern partitions.

CLASSIFIER()

Along with knowing which MATCH_NUMBER you are seeing, you may want to know which component of a pattern applies to a specific row. This is done using the CLASSIFIER() function. The classifier of a row is the pattern variable that the row is mapped to by a row pattern match. The CLASSIFIER() function returns a character string whose value is the name of the pattern variable defined within the PATTERN clause.

These two functions can only be used within the MEASURE clause. Interestingly, it is possible to use both these functions within the DEFINE clause but it is doubtful, at least in my mind, as to whether this would ever be useful. However, the following is possible:

DEFINE X AS (MATCH_NUMBER() <= PREV(price)),
Y AS (price >= PREV(price)),
W AS (price <= PREV(price)),
Z AS (price >= PREV(price)));

or you could have:

DEFINE X AS (CLASSIFIER() <= ‘Z’),
Y AS (price >= PREV(price)), 
W AS (price <= PREV(price)),
Z AS (price >= PREV(price)));

The SQL code will compile and execute but is it useful? Let me know if you can find a use case for putting these functions within the DEFINE clause.

Controlling the output

The return values from both the above functions is determined by the type of output defined as part of the MATCH_RECOGNIZE clause. Sometimes you want summary data about the matches within your data set and other times you need the full details. You can do that with the following syntax:

ONE ROW PER MATCH : each match produces one summary row. This is the default output.

ALL ROWS PER MATCH : a match spanning multiple rows will produce one output row for each row in the match. There are three sub options:

  • ALL ROWS PER MATCH SHOW EMPTY MATCHES - note this is the default
  • ALL ROWS PER MATCH OMIT EMPTY MATCHES
  • ALL ROWS PER MATCH WITH UNMATCHED ROWS

The idea of having unmatched rows is largely self-explanatory or at least I hope that it is! The idea of having empty matches might take a little more thinking….Let’s look at a very simple pattern which permits empty matches: the pattern PATTERN (A*). This can be matched by zero or more rows that are mapped to A (this is what the identifier * implies).

An empty match does not map any rows to pattern variables but it does in fact have a starting row. For example, there can be an empty match at the first row of a partition, an empty match at the second row of a partition, and so on. An empty match is actually assigned a sequential match number, based on its position with respect to the starting row in just the same way as any other match.

If we are using the ALL ROWS PER MATCH syntax then we need to cater for two scenarios in relation to handling empty matches:

  • ALL ROWS PER MATCH SHOW EMPTY MATCHES: with this option, any empty match generates a single row in the row pattern output table.
  • ALL ROWS PER MATCH OMIT EMPTY MATCHES: with this option, an empty match is omitted from the row pattern output table. However, it should be noted that this may cause gaps in the sequential match numbering.

Using the ALL ROWS PER MATCH (which defaults to SHOW EMPTY MATCHES), an empty match generates one row in the row pattern output table where the measures will have the following values:

  • CLASSIFIER() function returns null
  • MATCH_NUMBER () function returns the sequential match number of the empty match
  • value of any ordinary row pattern column reference is null
  • value of any aggregate or row pattern navigation operation is computed using an empty set of rows (so any COUNT is 0, and all other aggregates and row pattern navigation operations are null).
  • value of any column corresponding to a column of the row pattern input table is the same as the corresponding column in the starting row of the empty match.

 

Bringing it all together

First let’s start by using our normal stock ticker data set and searching for W-shaped patterns. Here is the code:

SELECT symbol, tstamp, price, first_x, last_z
FROM ticker
MATCH_RECOGNIZE (
  PARTITION BY symbol ORDER BY tstamp
  MEASURES FIRST(x.tstamp) AS first_x,
           LAST(z.tstamp) AS last_z
  ALL ROWS PER MATCH
  PATTERN (X+ Y+ W+ Z+)
  DEFINE X AS (price <= PREV(price)),
         Y AS (price >= PREV(price)),
         W AS (price <= PREV(price)),
         Z AS (price >= PREV(price)))
WHERE symbol='OSCORP';

…which returns the following output:

W-Shapes in ticker stream

For the symbol OSCORP we have matched two W-shapes: the first starts on 02-APR-11 and ends on 08-APR-11 and the second starts on 09-APR-11 and ends on 16-APR-11. Now let’s introduce our built-in measures and see how our data set is being matched to our pattern. To do this we add the built-in functions to our MEASURE clause as shown here:

SELECT symbol,
  tstamp,
  price,
  match_number,
  classifier,
  first_x,
  last_z
FROM ticker
MATCH_RECOGNIZE (
  PARTITION BY symbol ORDER BY tstamp
  MEASURES FIRST(x.tstamp) AS first_x,
           LAST(z.tstamp) AS last_z,
           MATCH_NUMBER() AS match_number,
           CLASSIFIER() AS classifier
  ALL ROWS PER MATCH
  PATTERN (X+ Y+ W+ Z+)
  DEFINE X AS (price <= PREV(price)),
         Y AS (price >= PREV(price)),
         W AS (price <= PREV(price)),
         Z AS (price >= PREV(price)));

…which returns the following output:

 

W-Shape pattern diagnostics for all symbols

Note that I have removed the WHERE clause so that you can see the how the MATCH_NUMBER() sequencing is handled across partition boundaries - i.e. you should note that the match numbering starts at 1 within each symbol-based partition. For symbol ‘ACME’ we have one matched one W-shaped pattern and the same applies to ‘GLOBEX’.

Within the OSCORP partition we matched two W-shapes. 

W-Shape pattern with diagnostics

 If we change the output to included the unmatched rows it becomes a lot easier to check the pattern against the whole data set:

SELECT symbol, 
  tstamp,
  price, 
  match_number,
  classifier,
  first_x, 
  last_z 
FROM ticker
MATCH_RECOGNIZE (
  PARTITION BY symbol ORDER BY tstamp
  MEASURES FIRST(x.tstamp) AS first_x,
           LAST(z.tstamp) AS last_z,
           MATCH_NUMBER() AS match_number,
           CLASSIFIER() AS classifier
  ALL ROWS PER MATCH WITH UNMATCHED ROWS
  PATTERN (X+ Y+ W+ Z+)
  DEFINE X AS (price <= PREV(price)),
         Y AS (price >= PREV(price)), 
         W AS (price <= PREV(price)),
         Z AS (price >= PREV(price)));

…which returns the following output:

W-shape including unmatched rows

 

What happens if we switch to returning summary information:

SELECT symbol,
  tstamp,
  first_price,
  last_price,
  match_number,
  classifier,
  first_x,
  last_z
FROM ticker
MATCH_RECOGNIZE (
  PARTITION BY symbol ORDER BY tstamp
  MEASURES FIRST(tstamp) as tstamp,
           FIRST(x.tstamp) AS first_x,
           LAST(z.tstamp) AS last_z,
           MATCH_NUMBER() AS match_number,
           CLASSIFIER() AS classifier,
           FIRST(x.price) as first_price,
           LAST(z.price) as last_price
  ONE ROW PER MATCH
  PATTERN (X+ Y+ W+ Z+)
  DEFINE X AS (price <= PREV(price)),
         Y AS (price >= PREV(price)),
         W AS (price <= PREV(price)),
         Z AS (price >= PREV(price)));

 

…which returns the following output:

W-shape pattern with one row per match

You should notice that the value returned by the CLASSIFIER() function is now truncated to the last variable listed in the DEFINE clause, which in this case is the pattern variable “Z”.  Therefore, when using CLASSIFIER() function to check how the pattern is being applied you should use one of the ALL ROWS PER MATCH options which allows you to see which rows are assigned to which pattern variable. The real benefit of this will become obvious when we look at greedy vs. reluctant quantifiers, which is coming up in a future post.

Note that even with the ONE ROW PER MATCH syntax the MATCH_NUMBER() function continues to operate as before. 

 

Summary

In this post we have looked at how to use the two built-in measures MATCH_NUMBER() and CLASSIFIER() within the MEASURES clause.  You can also use them in the DEFINE clause as well - if needed.

We have explored how the output from these measures changes as we switch from returning summary rows compared to detailed rows: ALL ROWS PER MATCH vs. ONE ROW PER MATCH. In the next post I will explore the concept of reluctant vs. greedy quantifiers.

Feel free to contact me if you have an interesting use cases for SQL pattern matching or if you just want some more information. Always happy to help. My email address is keith.laker@oracle.com.

 

Technorati Tags: , , ,

Friday, 1 April 2016

Are you a scuba diving DW/Big Data DBA? We are recruiting right now!

Is your dream to work in Hawaii? Are you an OCP 12c qualified data warehouse/big data DBA? Do you have an up to date scuba diving certificate? Want to join an elite team of Oracle DBAs? As part of an exciting new project we are looking for OCP Database 12c qualified DBAs who are also fully certified scuba divers. Before you get too excited please understand that this new role will require relocation to Hawaii. It is not possible to do this new role remotely from home or from a remote swimming pool.

To give you some background….We are following the pioneering lead of the gaming industry which makes extensive use of water-cooling to speed up processing on graphics cards.

Radeon 9600XT Heatpipe passive cooling

image courtesy of wikipedia 

The use of water-cooling has allowed these modified cards to run significantly faster boosting performance by incredible levels - 50-200x. Oracle has taken this idea to completely new levels with the latest generation of our engineered systems.

Our engineers, scuba DBAs and support teams have made an important performance breakthrough by developing a completely waterproof patchset for Database 12c - patchset 12.3.0.1.2. This has allowed us to completely submerge our engineered systems in salt water to achieve amazing levels of cooling. The resulting incredible boost in performance is way beyond the levels experienced by gamers experimenting with their water-cooled graphics cards. We have worked with a key petabyte-scale client to test different locations, different levels of salinity levels and water temperatures and we have found that the waters around Hawaii give us the biggest performance boost. Our client’s PB scale data warehouse performance tests have seen 500x increase in performance and that’s before we implemented the new water-based InfiniBand drivers which are an even bigger game-changer!

Below is a picture showing the expansion that is currently going on to expand our new offshore data center. As you can see we are really busy adding new engineered system racks into the water isles within the data center pool.  

NewImage

image courtesy of wikipedia

As we move forward with this exciting project we urgently need to grow our elite team of scuba DBAs at the underwater Hawaii data center. If you are a scuba DBA or know a scuba DBA who wants to move to Hawaii then contact me immediately (april.fools@oracle.com).

Obviously if you are already working with underwater computer systems (with our engineered systems or another hardware vendors platform) then we will put you at the top of our list of potential candidates. A relocation package to help with moving to Hawaii may be available for the right candidates but you will be expected to bring your own scuba equipment, waterproof keyboard-mouse and waterproof mobile phone. The first round of interviews will be held during OpenWorld in the lake at our Redwood Shores offices and the selection process will be very wet, very salty and held at realistic depths.

Don’t delay, get your CV to me (april.fools@oracle.com) right now. Hawaii is calling!

Technorati Tags: , , , ,