To be an efficient data scientist a proper toolset is paramount. As there are many solutions, it is difficult to select the good one. Here we will explore PostgreSQL as a database for data science.
Pro and cons
Pro: Rich SQL
PostgreSQL is old (first release in 1995) and its motto is “The World’s Most Advanced Open Source Relational Database”. No surprise that it supports a lot of SQL syntax. Let’s focus on 3 features:
- Common table expressions : fully supported to allow big queries to be more easily understood.
- Windows functions : Ability to use other rows data to compute a row level aggregate. NULLs handling is somewhat perfectible but overall it’s great.
- Unstructured data handling : PostgreSQL will be more than happy to work on XML, HStore and mainly JSON (with a text-based version and a binary one). That means that it can play nicely with NoSQL data if your dev architecture is based on MongDB.
It is also possible to develop stored functions in Java and Python (among others) to leverage the data locality and enhance your SQL queries.
Pro: Query Parallelism
Starting with PostgreSQL 10, the database is now able to leverage all the cores that modern processor have. That’s quite important as in a data science workload, usually, there is one big query running and it would be sad to leave most cores idle. We are talking about a 1x-8x performance increase (depending on the number of cores and the query).
Sadly, I doesn’t seems to work on Windows.
Partitioning is in PostgreSQL since a long time. Sadly, it was always quite difficult to use (using table inheritance and constraints). But now it’s far easier to manage thanks to declarative partitioning. You can expect a x1-x10 speedup depending on your query.
For instance, each year can be in a separate partition and if your queries only cover the current years, all previous data will be avoided by partition pruning.
Some labor is still needed so it should be used only when there is a painful performance issue.
Pro: Strong adoption
PostgreSQL is the most well-known database so most data science tools work well with it. The ODBC and JDBC (connection for most tools) are proven and work well.
By default, you can assume that everything is compatible with PostgreSQL.
Cons : No compression
Data compression is not only about having space on the hard drive. Sure, having not enough space can limit some studies. But it’s more about performance. Hard drive being usually the bottleneck, compression means less weight on it and faster performances (even if it put more load on the CPU, the tradeoff is generally good). Moreover, if the database can keep the data compressed in memory, you can store more data in this fast access cache, again more performance.
PostgreSQL doesn’t provide anything in this area which will limit data ingestion velocity.
Cons: No columnar tables
In most (all?) analytics databases, tables are not stored per rows but per columns. It also helps the compression as data of a same columns are more likely to be similar.
PostgreSQL doesn’t provide anything in this area which will limit data ingestion velocity. Reading a column from a 100 columns table will need all the table to be parsed so don’t have a 100 columns table. While counter-intuitive, it will usually be more efficient to break it in two or more tables.
Cons: No machine learning included
While most database for data science have machine learning included to some degree (Vertica is very strong in this area), PostgreSQL don’t come with anything. Sure, there is MADlib, but it’s not for Windows and lack some basics algorithm like random forests.
When using PostgreSQL, you will more than likely end up using an external machine learning library like Scikit-learn, Weka or anything else.
Being a generalist database, it will not guess that you have an analytics setting. The most obvious spot are memory settings. shared_buffer, work_mem and temp_mem should both be set at 1GB (maximum on Windows, should be 25% or RAM for the first one and 10% of RAM for the others). Notice that temp_mem is used as size for a sort buffer. A query can use many temp buffers so having a big size can quickly consumme all your memory. effective_cache_size is a hint on how many available memory will be left for the system (which will cache file blocks).
Below is my usual setting for a 16GB computer.
shared_buffers = 1GB work_mem = 1GB temp_mem = 1GB effective_cache_size = 8GB # Or total RAM / 2
The query optimizer relies on tables statistics to find the fastest execution plan. It is therefore paramount for those statistics to be right. It doesn’t seem to be working right after a bulk insert. Therefore, it is sometime useful to run it manually like below:
vacuum analyze my_table;
PostgreSQL is still one of the major choices a data scientist can make to store data. It is well suited to be installed on a computer running Windows. It will be good for datasets up to 100GB. After that, the lack of compression and columns storage will be too painful and there will be better alternatives. It is a good low-cost and easy to use analytics choice up to the Serie A.
The migration to more advanced systems is facilitated as many are derived from PostgreSQL (like Amazon Redshift and Greemplum).
While you are here, if your PostgreSQL is a data warehouse you might want to check Data Brewery. It’s an ETL tool to supercharge your data warehouse.
Let's stay in touch with the newsletter