Weka
When it comes to data mining the tool you use is very important. It seems that peoples use many software (see How many software packages is too much?). I’m currently using three tools : Weka, R and Microsoft Excel. When I have to, I’m also programming my own tools. Here is why I need all of them.

Weka

Weka is my favorite one as I come from the machine learning field. It bundle most of the predictions algorithms I need (clustering is a weaker part). I use it quasi exclusively by programming (sometimes the explorer interface). Thus, each experiment is a Java program.

The big advantage is that it’s in Java. Thus you can rely on eclipse for editing (auto-completion, error auto-correcting, …). If a behaviour don’t please me I can change it in no time. It’s also easy to integrate your result with a real application (which is often in Java too). When your data mining is quite complex, a Java program using Weka is the best way.

R

I’m not a big fan of R. I see it as a big bazaar (chaos?). Nevertheless, it’s quite useful as it feature many graphs tools. You can get a lot of insight of your data using R.

Sometimes it’s easy to do something, like the following that make a decision tree on the iris dataset.

> data(iris)
> tree <- rpart(Species ~ ., data = iris)
> tree
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)
  2) Petal.Length< 2.45 50   0 setosa (1.00000000 0.00000000 0.00000000) *
  3) Petal.Length>=2.45 100  50 versicolor (0.00000000 0.50000000 0.50000000)
    6) Petal.Width< 1.75 54   5 versicolor (0.00000000 0.90740741 0.09259259) *
    7) Petal.Width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087) *

Sometimes you loose many time because of some esoteric behaviour. It’s true that weka need a more verbose program to make the same thing. R is good for advanced exploration and some statistics stuffs.

Excel

Excel is not a data mining tool, but it is really useful in a data mining process. Using pivot table could be interesting. By example, I export my results to csv and create a pivot table in excel to find out where I fail in prediction, to actually understand the limits of a model. I often prefer to make some simple graph in Excel than R, or checkout some correlations. Excel is very good for simple exploration.

Then OpenOffice version of Excel is far from it. There is progress, but it’s just not enough. Sometimes Calc (its name), crash making a pivot table. It’s also slow as hell.

Conclusion

All these three tools have the limits that they can’t handle massive dataset (at least easily). They are all good in different areas. But there is sometimes I have to create my own program or use another library, for example to use genetic algorithms (try geal for that). Maybe you you use another good tools?


Let's stay in touch with the newsletter

Possible related posts:

  1. INFORMS Data Mining Contest Part 1
  2. Data Manipulation Part 2 : ETL
  3. Book review : Collective Intelligence in Action
  4. Data Manipulation Part 1 : SQL
  5. Mining Twitter data