How to intelligently aggregate approximations

The growth of low-cost storage platforms has allowed many companies to actively seeking out new external data sets and combine them with internal historical data that goes back over a very long time frame. Therefore, as both the type of data and the volume of data continue to grow the challenge for many businesses is how to process this every expanding pool of data and at the same time, make timely decisions based on all the available data.

(Image above courtesy of http://taneszkozok.hu/)

In previous posts I have discussed whether an approximate answer is just plain wrong and whether approximate answers really are the best way to analyze big data. As with the vast majority of data analysis at some point there is going to be a need to aggregate a data set to get a higher level view across various dimensions. When working with results from approximate queries, dealing with aggregations can get a little complicated because it is not possible to “reuse” an aggregated approximate result as a basis for aggregating the results to an even higher level across the various dimensions of the original query. To obtain a valid approximate result set requires a query to rescan the source data and compute the required analysis for the given combination of levels. Just because I have a result set that contains a count of the number of unique products sold this week at the county level does not mean that I can simply reuse that result set to determine the number of distinct products sold this week at the state level. In many cases you cannot just rollup aggregations of aggregations..

Cannot aggregate data from existing approximate result sets

Cannot aggregate data from existing approximate result sets

Until now….

With Database 12c Release 2 we have introduced a series of new functions to deal with this specific issue - the need to create reusable aggregated results that can be “rolled-up” to higher aggregate levels. So at long last you can now intelligently aggregate approximations! Here is how e do it….Essentially there are three parts that provide the “intelligence”:

APPROX_xxxxxx_DETAIL
APPROX_xxxxxx_AGG
TO_APPROX_xxxxxx

Here is a quick overview of each function:

APPROX_xxxxxx_DETAIL
This function takes a numeric expression and builds summary result set containing results for all dimensions in GROUP BY clause. The output from this function is a column containing BLOB data. As with other approximate functions the results can be deterministic or non-deterministic depending on your requirements.

APPROX_xxxxxx_AGG
This function builds a higher level summary table based on results from the _DETAIL function. This means that it is not necessary to re-query base fact table in order to derive new aggregates results. As with the _DETAIL function, the results are returned as a blob.

TO_APPROX_xxxxxx
Returns results from the _AGG and _DETAIL functions in user readable format.

Three new functions for managing approximate aggregations

A worked example

Let’s build a working example using the sample SH schema. Our product marketing team wants to know within each year the approximate number of unique customers within each product category. Thinking ahead we know that once they have this result set we expect them to do further analysis such as drilling on the time and product dimension levels to get deeper insight. The best solution is to build a reusable aggregate approximate result set using the new functions in 12.2.

SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;

this returns my result set as a blob as shown below and this blob contains the various tuple combinations from my GROUP BY clause. As a result I can reuse this result set to answer new questions based around higher levels of aggregation.

Aggregated results from query contained within BLOB column

Aggregated results from query contained within BLOB column

and the explain plan for this query shows the new sort keywords (GROUP BY APPROX) which tells us that approximate processing has been used as part of this query.
Explain plan showing new sort keywords

If we want to convert the BLOB data into a readable format we can transform it by using the TO_APPROX_xxx function as follows:

SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_DETAIL(s.cust_id)) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;

this creates the following results
shows use of TO_APPROX function to transform blob data

alternatively we coult create a table using the above query and then simply pass the BLOB column directly into the TO_APPROX function as follows:

CREATE TABLE agg_cd AS
SELECT
t.calendar_year,
t.calendar_quarter_number AS qtr,
p.prod_category_desc AS category_desc,
p.prod_subcategory_desc AS subcategory_desc,
APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg
FROM sales s, products p, times t
WHERE s.time_id= t.time_id
AND p.prod_id = s.prod_id
GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc
ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;

using this table we can simplify our query to return the approximate number of distinct customers directly from the above table:

SELECT
calendar_year,
qtr,
category_desc,
subcategory_desc,
TO_APPROX_COUNT_DISTINCT(acd_agg)
FROM agg_cd
ORDER BY calendar_year, qtr, category_desc, subcategory_desc;

which returns the same results as before - as you would expect!

using TO_APPROX function directly against BLOB colum

using the aggregated table as our source we can now change the levels that we wish to calculate without having to go back to the original source table and again scan all the rows. However, to extract the new aggregations we need to introduce the third function APPROX_COUNT_DISTINCT_AGG to our query and wrap this within the TO_APPROX_COUNT_DISTINCT function to see the results:

SELECT
calendar_year,
subcategory_desc,
TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_AGG(acd_agg))
FROM agg_cd
GROUP BY calendar_year, subcategory_desc
ORDER BY calendar_year, subcategory_desc;

will return the following results based only on the new combination of levels included in the GROUP BY clause:

Using higher level aggregations

Summary

This post has reviewed the three new functions that we have introduced in Database 12c Release 2 that allow you to reuse aggregated approximate result sets:

APPROX_xxxxxx_DETAIL
APPROX_xxxxxx_AGG
TO_APPROX_xxxxxx

Database 12c Release 2 makes it possible to intelligently aggregate approximations. In the next post I will explore how you can combine approximate processing with existing query rewrite functionality so you can have intelligent approximate query rewrite.

Technorati Tags: Analytics, Database 12c, Oracle Database 12c, SQL, SQL Analytics

Search This Blog

SQL for Data Warehousing and Analytics