Wednesday, 15 April 2015

Managing overflows in LISTAGG

IMG 0196 This is an interesting problem that has come up a few times in discussions (and I think it has been mentioned on the SQL forums as well).  When using LISTAGG on very large data sets you can sometimes create a list that is too long and consequently get an ORA-01489: result of string concatenation is too long  error.

Obviously, it is possible to determine in advance if the error is going to occur using some bespoke PL/SQL code and then take appropriate action within your application to manage the rows that contain stings that exceed the VARCHAR2 limit.

Many customers have implemented workarounds to overcome the ORA-01489 error, however, this has mostly involved the use of complex code which has impacted performance.

Wouldn’t it be great if there was a simple yet elegant way to resolve this issue? Actually there is and we can use a few of the most recent analytical SQL functions. If you are using Database 12c you can make use of the MATCH_RECOGNIZE function to effectively create chunks of strings that do not exceed the VARCHAR2 limit.

If you are using an earlier version of the Oracle Database then we can make use of the powerful SQL Model clause that was introduced in Database 10g.

In fact the beauty of the workarounds outlined in this blog post is that you can define your own maximum size for the string returned by LISTAGG so you have complete control over the size/length of the result set that is returned.

For example, let’s assume that we have the following statement (to keep things relatively simple let’s use the EMP table in the schema SCOTT)

SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp
GROUP BY deptno;

this statement will generate the following output:

DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING;MILLER
20 SMITH;JONES;SCOTT;ADAMS;FORD
30 ALLEN;WARD;MARTIN;BLAKE;TURNER;JAMES

 Let’s assume that the above statement does not run and that we have a limit of 15 characters that can be returned by each row in our LISTAGG function. We can use the Database 12c SQL pattern matching function, MATCH_RECOGNIZE, to return a list of values that does not exceed 15 characters. First step is to wrap the processing in a view so that we can then get data from this view to feed our LISTAGG function. Here is the view that contains the MATCH_RECOGNIZE clause:

CREATE OR REPLACE VIEW emp_mr AS
SELECT * FROM emp MATCH_RECOGNIZE(
   PARTITION BY deptno ORDER BY empno
   MEASURES match_number() AS mno
   ALL ROWS PER MATCH
   AFTER MATCH SKIP PAST LAST ROW
   PATTERN (S B*)
   DEFINE B AS LENGTHB(S.ename) + SUM(LENGTHB(B.ename)) + LENGTHB(‘;’) < = 15
);

You might well ask: why don’t we put the LISTAGG function inside the measure clause? At the moment it is not possible to include analytical functions such as LISTAGG in the measure clause. Therefore, we have put the LISTAGG function in a separate SQL statement:

SELECT
 deptno,
 LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp_mr
GROUP BY deptno, mno;

The above code will produce the following output:

DEPTNO     NAMELIST
---------- ----------------------------------------
10         CLARK;KING
10         MILLER
20         SMITH;JONES
20         SCOTT;ADAMSbr
20         FORD
30         ALLEN;WARD
30         MARTIN;BLAKE
30         TURNER;JAMES

Using 12c MATCH_RECOGNIZE, we are able to get all the values by sending just one query. Now not everyone who is hitting this ORA-01489 error is running 12c, so what do you do if you are running an earlier version of the database and need an efficient fix for this error? Prior to 12.1, the overflow can be handled using SQL Model clause. This approach is a little more complicated. First we create a view that returns the row number:

CREATE OR REPLACE VIEW emp_sqm_view AS
SELECT
  emp.*,
  row_number() OVER (PARTITION BY deptno ORDER BY empno) as rn
FROM emp;

the above statement generates the following output with the row number added for each employee within each department. Thinking about this in Excel terms, we will use this view to feed our SQL Model clause, therefore, the view will allow us to create a “worksheet” for each department and then list the employees for each department as a row on that “worksheet”:

EMPNO ENAME  JOB       MGR   HIREDARE  SAL  COMM DEPTNO  RN
===== ====== ========= ===== ========= ==== ==== ======= ==
7782 CLARK  MANAGER    7839  09-JUN-81 2450      10      1
7839 KING   PRESIDENT        17-NOV-81 5000      10      2
7934 MILLER CLERK      7782  23-JAN-82 1300      10      3
7369 SMITH  CLERK      7902  17-DEC-80 800       20      1
7566 JONES  MANAGER    7839  02-APR-81 2975      20      2
7788 SCOTT  ANALYST    7566  19-APR-87 3000      20      3
7876 ADAMS  CLERK      7788  23-MAY-87 1100      20      4
7902 FORD   ANALYST    7566  03-DEC-81 3000      20      5
7499 ALLEN  SALESMAN   7698  20-FEB-81 1600  300 30      1
7521 WARD   SALESMAN   7698  22-FEB-81 1250  500 30      2
7654 MARTIN SALESMAN   7698  28-SEP-81 1250 1400 30      3
7698 BLAKE  MANAGER    7839  01-MAY-81 2850      30      4
7844 TURNER SALESMAN   7698  08-SEP-81 1500    0 30      5
7900 JAMES  CLERK      7698  03-DEC-81  950      30      6


Now the next step is to calculate the running total of the length of the text string of employee names within each department. The SQL Model clause allows us to calculate the running total and reset the total if it exceeds the number of characters that we have set as the limit - in this case 15 characters. This code is wrapped in the first of the views that we are going to build:

CREATE OR REPLACE VIEW emp_model_1 AS (
SELECT *
FROM emp_sqm_view
  MODEL
   PARTITION BY (deptno)
   DIMENSION BY (rn)
   MEASURES (empno, ename, LENGTHB(ename) len, 0 cum_len)
   RULES (
    cum_len[1] = LEN[1] + LENGTH(';'),
    cum_len[rn > 1] = CASE WHEN cum_len[cv()-1] + LEN[cv()] + LENGTHB(‘;’) <= 15
                      THEN cum_len[cv()-1] + LEN[cv()] + LENGTHB(';')
                      ELSE LEN[cv()] + LENGTH(';')
    END))


Next we need an id that we can use to group together the list of names that fit within our boundary of 15 characters. Again we will use the SQL Model clause to compute the break points and increment the counter within each department:

CREATE OR REPLACE VIEW emp_model_2 AS
SELECT *
FROM emp_model_1
MODEL PARTITION BY (dept no) DIMENSION BY (rn)
   MEASURES (empno, ename, len, cum_len, 0 sub_id)
   RULES (
     sub_id[1] = 1,
     sub_id[rn > 1] = CASE WHEN cum_len[cv()] = len[cv()] + LENGTHB(‘;’)
                           THEN sub_id[cv() - 1] + 1
                           ELSE sub_id[cv() - 1]
     END));


 now we have the following output:

DEPTNO RN EMPNO ENAME  LEN CUM_LEN SUB_ID
====== == ===== ====== === ======= ======
    10  2  7839 KING     4      11      1
    10  1  7782 CLARK    5       6      1
    10  3  7934 MILLER   6       7      2
    20  1  7369 SMITH    5       6      1
    20  5  7902 FORD     4       5      3
    20  4  7876 ADAMS    5      12      2
    20  3  7788 SCOTT    5       6      2
    20  2  7566 JONES    5      12      1
    30  1  7499 ALLEN    5       6      1
    30  2  7521 WARD     4      11      1
    30  3  7654 MARTIN   6       7      2
    30  6  7900 JAMES    5      13      3
    30  5  7844 TURNER   6       7      3
    30  4  7698 BLAKE    5      13      2


we can now use the column headed sub_id along with the department number to create GROUP BY clause which will finally generate our list of employees within each department where each row in the column namelist will contain a maximum of 15 characters:

SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp_model_2
group by deptno, sub_id;

The above query creates the following output which matches the final output we achieved using the MATCH_RECOGNIZE clause:

DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING
10 MILLE
20 SMITH;JONES
20 SCOTT;ADAMS
20 FORD
30 ALLEN;WARD
30 MARTIN;BLAKE
£0 TURNER;JAMES


If you are using the LISTAGG function and sometimes getting an ORA-01489: result of string concatenation is too long  error then hopefully the above workarounds are useful.

The code samples shown above were created by Lei Sheng who is a Principal Member of Technical Staff, Database Development. Lei is worked on many of the really cool analytical SQL features including the SQL Model clause and the new MATCH_RECOGNIZE clause. So big thanks to Lei for sharing these workarounds.

NEW UPDATE - January 2016

I have been working on some 12.2 projects and had to revisit the SQL Model code. While working on a different problem I spotted that the results from the model clause, while giving the right outcome, were not quite correct. It seems that the computed number of characters was actually incorrect. The model was adding the ';' character too many times so the final count of the string length was actually too high.

Therefore, I have modified the code to correctly compute the full length of the string taking into account the inclusion of only one instance of the ';' character. Here is the amended code for the first view emp_model_1

CREATE OR REPLACE VIEW emp_model_1 AS (
SELECT *
FROM emp_sqm_view
MODEL
  PARTITION BY (deptno)
  DIMENSION BY (rn)
  MEASURES (empno, ename, LENGTHB(ename) len, 0 cum_len)
  RULES (
      cum_len[1] = LEN[1],
      cum_len[rn > 1] = CASE WHEN cum_len[cv()-1] + LEN[cv()] + LEN[cv()+1] + LENGTHB(';') > 15
              AND cum_len[cv()-1] + LEN[cv()] + LENGTHB(';')
          THEN cum_len[cv()-1] + len[cv()] + LENGTH(';')
        WHEN cum_len[cv()-1] + LEN[cv()] + LENGTHB(';')           THEN cum_len[cv()-1] + len[cv()]
          ELSE len[cv()]
END));

I have used bold text to highlight the change that I have mode to the code. This now generates the following ouptut:

DEPTNO     RN         EMPNO      ENAME      LEN        CUM_LEN
---------- ---------- ---------- ---------- ---------- ----------
        30          1       7499      ALLEN          5          5
        30          2       7521      WARD           4         10
        30          3       7654      MARTIN         6          6
        30          4       7698      BLAKE          5         12
        30          5       7844      TURNER         6          6
        30          6       7900      JAMES          5         11
        20          1       7369      SMITH          5          5
        20          2       7566      JONES          5         11
        20          3       7788      SCOTT          5          5
        20          4       7876      ADAMS          5         10
        20          5       7902      FORD           4         14
        10          1       7782      CLARK          5          5
        10          2       7839      KING           4         10
        10          3       7934      MILLER         6          6

There is a minor change to the second view emp_model_2 as follows, again changes highlighted in bold

CREATE OR REPLACE VIEW emp_model_2 AS (
SELECT *
FROM emp_model_1
MODEL
  PARTITION BY (deptno)
  DIMENSION BY (rn)
  MEASURES (empno, ename, len, cum_len, 0 sub_id)
  RULES (
      sub_id[1] = 1,
      sub_id[rn > 1] = CASE WHEN cum_len[cv()] = len[cv()]
          THEN THEN sub_id[cv()-1] + 1
           ELSE sub_id[cv()-1]

END));

The final output is the same but I feel a bit happier that the numbers used to split up the list are now accurate and there is no double counting of the ';' character.

Technorati Tags: , , , , , ,