Mysql LogoPS : This post is quite old now and isn’t relevant anymore. MySQL 5.6 introduced hash join which basically makes it more suitable to a data warehouse setting.

 

I often use a database not only to store data but also to do some treatment before mining and some analysis. I use MySQL as back-end. I have also considered other free software alternatives, but I feel uncomfortable with Firebird and PostgreSQL lacks of analytical functions like WITH ROLLUP.

MySQl is an atypical DBMS in the sense that it has many different storage engines for the data. Here is a short description of each.

MyISAM

MyISAM is the historical MySQL storage engine. It doesn’t support transaction, neither foreign keys. His big problem is his lack of data cache, he never store any row in memory in case you access it again. This defect is partially compensated by the operating system cache.

InnoDB

InnoDB is the transactional engine of MySQL. He store rows in the primary index thus search with the primary index should be quite fast. He has a data cache.

Memory

Memory, store data only in memory. All content (exception of the structure) is lost with a shutdown of the database. thus you have to populated these table at each MySQL launch. He’s very fast but he’s limited by the amount of memory you allocate.

Archive

Archive is the solution for read only (infrequently) data. There is an on the fly compression to reduce storage cost. Thus the I/O operation are reduced but the CPU is more used (for decompression). There is currently no support for index, each request imply a linear read of the table.

I’ve modeled a star schema with two dimension of 100k rows and a fact table of 10M rows. The footprint of the fact table is dependant on the storage engine used :

Engine Footprint
innodb 416Mo
myisam 204Mo (+184Mo of index)
archive 24Mo

Archive does a great work at reducing the footprint.  InnoDB and MyIsam takes around the same space.

Now, let’s look the tme taken for a query which extract some statistics about the dataset.

SELECT d.yeard, c.country, c.city, sum(qty) FROM fact f  JOIN client c  ON f.clientId = c.id
     JOIN dated d ON f.dateId = d.id GROUP BY yeard, country, city

/* Result. */
+-------+-----------+-----------+----------+
| yeard | country   | city      | sum(qty) |
+-------+-----------+-----------+----------+
|  1981 | Allemagne | Berlin    |  3161889 |
|  1981 | Allemagne | Bohn      |  3468275 |
|  1981 | Allemagne | Frankfort |  3467403 |
|  1981 | France    | Lille     |  2244612 |
|  1981 | France    | Lyon      |  2243573 |
|  1981 | France    | Paris     |  2652841 |
|  1981 | UK        | Liverpool |  3979256 |
|  1981 | UK        | Londres   |  4283599 |
+-------+-----------+-----------+----------+

We use the following command while running the query in order to disable OS cache effect.

watch "echo 3 > /proc/sys/vm/drop_caches"

Results depending on the used storage engine are  :

Engine for fact Engine for client Engine for date Query time
myisam myisam myisam 64s
innodb innodb innodb 33s
archive innodb innodb 28s
innodb memory memory 28s
myisam innodb innodb 23s
archive memory memory 21s
myisam memory memory 19s

As you can see, results greatly depends on the type of engine you use. The better configuration I see is to use MyIsam for fact table. It’s very fast in sequential read. For dimension tables, innodb is a better choice. Dimension table are used in a look-up way and could fit in the allocated memory, thus the data cache of InnoDB does a great work.


Let's stay in touch with the newsletter

Possible related posts:

  1. PostgreSQL for data science : pro and cons
  2. Data Manipulation Part 1 : SQL
  3. Big data benchmark : Impala vs Hawq vs Hive
  4. The call for a Modular Data Warehouse
  5. Data Warehouse Manifesto : Enabling agile analytics