Usually we use sequence to generate unique identifier for surrogate keys.  A sequence is simply a database object that return a number every time you call it. Sadly today I face a limit in the sequence implementation of Greenplum. In the process of finding a solution, I found a way to generate surrogate keys faster than sequence (another way is in this blog).

Let’s take the query below :

create temp sequence seq; -- let make it session only
with raw_data as (
	select 'CEO' as title, 'CEO' as boss
	union all
	select 'CFO' as title, 'CEO' as boss
	union all
	select 'CTO' as title, 'CEO' as boss
),
with_surrogate as (
	select nextval('seq') as title_key,
		title,
		boss
	from raw_data
),
self_join as (
	select c.title_key, c.title, boss.title_key as boss_title_key
	from with_surrogate c
	inner join with_surrogate boss on boss.title = c.boss
)
select *
from self_join

In Postgresql you obtain the table below. Instead of having a string as a key, you now have an int which is better for dimension lookup and can be helpful to make and Slowly Changing Dimension.

In Greenplum 4.2 you will get something likethe table below. So the boss of every position is the key 2 which doesn’t appear in the table.

It work that way because of a lazy evaluation of the with_surrogate subquery I suppose. Anyway, let’s see the solution. It’s to used windows function (or row_num in Oracle) :

with raw_data as (
    select 'CEO' as title, 'CEO' as boss
    union all
    select 'CFO' as title, 'CEO' as boss
    union all
    select 'CTO' as title, 'CEO' as boss
),
with_surrogate as (
    select row_number() over (order by title) 
        + (select coalesce(max(title_key),0) from dim_title) as title_key,
        title,
        boss
    from raw_data
),
self_join as (
    select c.title_key, c.title, boss.title_key as boss_title_key
    from with_surrogate c
    inner join with_surrogate boss on boss.title = c.boss
)
select *
from self_join

row_number will give you sequential identifiers for the query result (1,2,3,…) and  the select max from the destination will give you the offset in order to avoid collision with already existing keys. Notice here I use order by title. You need it only because of the lazy evaluation of Greenplum, you can use row_number() over () if there is  no self join before materialization.  row_number() over () is a free operation in Greenplum (and I believe in many DBMS) an the select max is really fast when you have compressed columnar storage and it’s executed only once. I made some benchmarking and creating 5 millions new keys from a fact table with already 5 millions rows took 13 seconds (2s for computing the max) instead of 91 seconds with the sequence. If the max query took too long you can use a table to keep track of the max but it will complicate the mix. You can’t use concurrent insertion with this techniques but in ETL you control the insertions so it shouldn’t be a problem.


Let's stay in touch with the newsletter

Possible related posts:

  1. Data Manipulation Part 1 : SQL
  2. Using MySQL as a Data Warehouse
  3. Big data benchmark : Impala vs Hawq vs Hive
  4. Master query for SaaS startups analytics
  5. Data Manipulation Part 2 : ETL