Data manipulation is a big part of a data mining process. Some authors claims it could take 80% of a data mining project. I could only agree. If data comes from the data warehouse it could be a lot faster. If you have to dig (and understand) operational systems or adding some externals data the works takes even more time. Therefore it is of greatest importance to be efficient in data manipulation. Currently I use two way to do this task : big SQL queries or ETL depending on the situation.
Big SQL Queries
Usually to extract data from a database, you create an SQL query and export the result set to a CSV (comma separated values) or Excel file. Depending on your knowledge of Excel and SQL, you do some post-treatment in Excel or not. I think it is more convenient to do everything in the query in order to be able to reproduct the dataset at any time. I found some SQL tips very useful for big analysis.
Common Table Expressions (CTE)
Using subqueries (and sub sub sub … sub queries) makes a query just impossible to understand (at least for me). The WITH keyword makes things simple in a more functional or procedural fashion (without any performance hit, we still use ensemble theory).
with customers as ( /* Do some cleanup in clients */ select CLIE_CLIENT_ID as id, lower(CLIE_NAME) as name, decode(CLIE_TYPE, 'I', 'Individual', 'P', 'Professional', 'Unknown') as category, CLIE_DISTRICT as district, case when CLIE_DT_BEGIN > sysdate - 365 then 'New' else 'Old' end as age from CLIENT where CLIE_DISABLED = 'F' ), location as ( /* Find some geographical attributes */ select DIST_DISTRICT_ID as district, DIST_POPULATION as district_pop, COUN_POPULATION as country_pop from DISTRICT join COUNTRY on DIST_COUNTRY_ID = COUN_COUNTRY_ID ) /* Merge all */ select id, name, category, age, district_pop, country_pop from customers natural join location;
It is easy to understand when thing are break in little separated parts which is what the WITH does.
Analytical aggregations are included in SQL since SQL2003 (i.e. support may vary depending on your DBMS). It allows you to extends each line of a table with aggregated values. Please find a good description here. For time series you can add the moving average at each point (which is sometime more predictive than the point value). For each customer transaction, you can add the sum of the amount for all transaction of this customer.
Analytical queries aren’t the same as transactionnal ones. Usually we need to join many rows together. When joining two tables, RDBMS generally read each rows of one table, let say A, (or the subset using the where part of the query) and lookup to the second table, let say B, using an index. It requires a lot of random reads and is an iterative operation which is something we want to avoid in a database. The complexity is a.log(b) with a the number of row which need to be read in A, and b the total number of rows in B. Oracle could do MERGE join and HASH join, I will briefly explain how they work.
MERGE join takes the subset of A (with the WHERE clause) and the subset of B. It order them by the join key, then make the join. If the join is an equijoin (using just ‘=’) it take only one read of both subset to make the join. The sort operation is what is costly here. So you need to restrict the subset of A and B to the minimum or allow the use of an ordered index.
HASH join is like the LOOKUP but it create first a hashmap with the B subset, thus access is very fast (complexity of 1). The construction of the hasmap is b’.log(b’) (not exactly sure) where b’ is the number of rows in the subset of B. If b’ is a lot smaller then b then it’s very efficient.
You could read here for queries optimization in Oracle. I don’t discuss here for bitmap indexes which are of great help but requires DBA action. The analyse table operation is of course very very useful too, it allows Oracle to find the best join strategy for you.
Let's stay in touch with the newsletter
Possible related posts: