Using MySQL as a Data Warehouse

Mysql Logo I often use a database not only to store data but also to do some treatment before mining and some analysis. I use MySQL as back-end. I have also considered other free software alternatives, but I feel uncomfortable with Firebird and PostgreSQL lacks of analytical functions like WITH ROLLUP.

MySQl is an atypical DBMS in the sense that it has many different storage engines for the data. Here is a short description of each.

MyISAM

MyISAM is the historical MySQL storage engine. It doesn’t support transaction, neither foreign keys. His big problem is his lack of data cache, he never store any row in memory in case you access it again. This defect is partially compensated by the operating system cache.

InnoDB

InnoDB is the transactional engine of MySQL. He store rows in the primary index thus search with the primary index should be quite fast. He has a data cache.

Memory

Memory, store data only in memory. All content (exception of the structure) is lost with a shutdown of the database. thus you have to populated these table at each MySQL launch. He’s very fast but he’s limited by the amount of memory you allocate.

Archive

Archive is the solution for read only (infrequently) data. There is an on the fly compression to reduce storage cost. Thus the I/O operation are reduced but the CPU is more used (for decompression). There is currently no support for index, each request imply a linear read of the table.

I’ve modeled a star schema with two dimension of 100k rows and a fact table of 10M rows. The footprint of the fact table is dependant on the storage engine used :

Engine	Footprint
innodb	416Mo
myisam	204Mo (+184Mo of index)
archive	24Mo

Archive does a great work at reducing the footprint. InnoDB and MyIsam takes around the same space.

Now, let’s look the tme taken for a query which extract some statistics about the dataset.

SELECT d.yeard, c.country, c.city, sum(qty) FROM fact f  JOIN client c  ON f.clientId = c.id
     JOIN dated d ON f.dateId = d.id GROUP BY yeard, country, city

/* Result. */
+-------+-----------+-----------+----------+
| yeard | country   | city      | sum(qty) |
+-------+-----------+-----------+----------+
|  1981 | Allemagne | Berlin    |  3161889 |
|  1981 | Allemagne | Bohn      |  3468275 |
|  1981 | Allemagne | Frankfort |  3467403 |
|  1981 | France    | Lille     |  2244612 |
|  1981 | France    | Lyon      |  2243573 |
|  1981 | France    | Paris     |  2652841 |
|  1981 | UK        | Liverpool |  3979256 |
|  1981 | UK        | Londres   |  4283599 |
+-------+-----------+-----------+----------+

We use the following command while running the query in order to disable OS cache effect.

watch "echo 3 > /proc/sys/vm/drop_caches"

Results depending on the used storage engine are :

Engine for fact	Engine for client	Engine for date	Query time
myisam	myisam	myisam	64s
innodb	innodb	innodb	33s
archive	innodb	innodb	28s
innodb	memory	memory	28s
myisam	innodb	innodb	23s
archive	memory	memory	21s
myisam	memory	memory	19s

As you can see, results greatly depends on the type of engine you use. The better configuration I see is to use MyIsam for fact table. It’s very fast in sequential read. For dimension tables, innodb is a better choice. Dimension table are used in a look-up way and could fit in the allocated memory, thus the data cache of InnoDB does a great work.

Let's stay in touch with the newsletter

2 Comments

Add yours

dodo
September 23, 2009 at 13:26

Nice article, thanks for it!
Looks like storage engine is an another dimension in MySQL DWH design 🙂
Have you created indexes for innodb in “innodb memory memory 28s” ?
Sébastien Derivaux
October 29, 2009 at 00:36

Yes for the index.

Not easy to managed so many engines when no one is perfect and still no column oriented storage engine in the main Mysql version.

Data science for data-driven startups

Using MySQL as a Data Warehouse

MyISAM

InnoDB

Memory

Archive

2 Comments

Add yours

Leave a Reply Cancel reply

Can I help ?

Subscribe to the mailing list

Search the site

Tags

Search the site

Last posts