tropheA new data mining contest is available here.  The functional domain is medical, more precisely there is two tasks. First, we need to prediction if a given patient will be transferred to another hospital. The second task is to predict if the patient will die (the medical domain definitively lacks of fun). For each task, we give a score from the most probable patient to the least. The dataset contains many challenges. In this post, I propose my personals ideas to handle these challenges.


Each patient is represented by a sequence (previous visits and the current one). For a given patient we have many lines in file. The sequence is not length fixed so we can’t just put everything on one line with concatenation.

Ensemble attributes

There is also some ensemble attributes (an attribute which the value is an ensemble). In the data file it is represented by Other-Dx-Code-1, Other-Dx-Code-2, … with Other-Dx-Code-9 often missing. There is also Principal-Dx-Code and Admit-Dx-Code which I see part of the ensemble.

Hierarchical attributes

Some attributes are hierarchy. For instance, Hospital-ID and Region-ID are two levels of a geographical hierarchy. I don’t know how hierarchy can be used in data mining (well in a clever way than standard attributes). I could be interesting for generalization purposes and reducing overfitting.

It’s relational

These three problems have in common their relational nature. I think it’s madness to use it directly as a single table, I think that we need to better formalize the problem  first. Then we could construct a single table using to feature of relational data mining, selection graphs and aggregation (either manually or automatically). Notice that the last link is a paper from the contest organiser Claudia Perlich thus I think I couldn’t be so wrong. I don’t know if it’s the better way, but if I do something it will be clearly in this direction.

Let's stay in touch with the newsletter

Possible related posts:

  1. Data mining tools
  2. Mining Twitter data
  3. Data Manipulation Part 1 : SQL
  4. Book review : Collective Intelligence in Action
  5. Big data benchmark : Impala vs Hawq vs Hive