When it comes to data mining the tool you use is very important. It seems that peoples use many software (see How many software packages is too much?). I’m currently using three tools : Weka, R and Microsoft Excel. When I have to, I’m also programming my own tools. Here is why I need all of them.
Weka
Weka is my favorite one as I come from the machine learning field. It bundle most of the predictions algorithms I need (clustering is a weaker part). I use it quasi exclusively by programming (sometimes the explorer interface). Thus, each experiment is a Java program.
The big advantage is that it’s in Java. Thus you can rely on eclipse for editing (auto-completion, error auto-correcting, …). If a behaviour don’t please me I can change it in no time. It’s also easy to integrate your result with a real application (which is often in Java too). When your data mining is quite complex, a Java program using Weka is the best way.
R
I’m not a big fan of R. I see it as a big bazaar (chaos?). Nevertheless, it’s quite useful as it feature many graphs tools. You can get a lot of insight of your data using R.
Sometimes it’s easy to do something, like the following that make a decision tree on the iris dataset.
> data(iris) > tree <- rpart(Species ~ ., data = iris) > tree n= 150 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 150 100 setosa (0.33333333 0.33333333 0.33333333) 2) Petal.Length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000) * 3) Petal.Length>=2.45 100 50 versicolor (0.00000000 0.50000000 0.50000000) 6) Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259) * 7) Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913 0.97826087) *
Sometimes you loose many time because of some esoteric behaviour. It’s true that weka need a more verbose program to make the same thing. R is good for advanced exploration and some statistics stuffs.
Excel
Excel is not a data mining tool, but it is really useful in a data mining process. Using pivot table could be interesting. By example, I export my results to csv and create a pivot table in excel to find out where I fail in prediction, to actually understand the limits of a model. I often prefer to make some simple graph in Excel than R, or checkout some correlations. Excel is very good for simple exploration.
Then OpenOffice version of Excel is far from it. There is progress, but it’s just not enough. Sometimes Calc (its name), crash making a pivot table. It’s also slow as hell.
Conclusion
All these three tools have the limits that they can’t handle massive dataset (at least easily). They are all good in different areas. But there is sometimes I have to create my own program or use another library, for example to use genetic algorithms (try geal for that). Maybe you you use another good tools?
Let's stay in touch with the newsletter
July 20, 2009 at 20:31
I use MATLAB quite a bit and have been experimenting recently with EDM (“Enterprise Data Miner”) and RIK (“Rule Induction Kit”) from http://www.data-miner.com (only US$25 each!).
July 23, 2009 at 00:12
I wasn’t aware that MATLAB was used for real data mining studies. A point to dig. Your blog is a very good start.
September 23, 2009 at 12:11
Hi,
what I see as the biggest challenge in DM is data-preparation. That’s not task for Weka, not for Excel and I’m not familiar with R.
SAS Base does the great job there. SPSS Modeler as well. SPSS Statistic trial is available at http://www.spss.com.
Java is cool. But you are wasting time with programming. Data-miner has more important task to do than generating tons of code.
Doesn’t (s)he?
September 24, 2009 at 13:27
@dodo
I disagree for two reasons:
– I cannot count the algorithms I can give you a understandable description of in 1 minute, but when it comes to a real data analysis you will meet special cases where you have to know EXACTLY how this algorithm is implemented. That is the reason I could never work with non-open source programs
– If you are not able to write code (at least for changing the behavior of present algorithms or create new ones) you restrict yourself to use only what’s available. Are you sure your data mining environment is prepared for every possible data analysis problem ?
@tools: You forgot RapidMiner (former Yale) which does an excellent job in handling large datasets and data preparation (its key focus). It is free, it is open source and it is written in java.
@this blog: This is my first visit and I really like it. Design, content … sweet and technical, that is how I like it.
October 29, 2009 at 00:33
@Steffen I was not having the chance to use Rapidminer. Of course it could be interresting as it become most popular than Weka.
@dodo data preparation is of course a big topic. I’m using Talend to do this job, very efficient and easy (and you could add Java in it). Most ETL package will work.
October 31, 2013 at 10:14
Hello Sébastien Derivaux,
I do agree with you these tools are really useful for data mining process. If you’ll consider “Weka” – I must say it is the collection of machine learning algorithms for solving real-world data mining problems. “R” – R, also called GNU S, is a strongly functional language and environment to statistically explore data sets. It make many graphical displays. “Excel” – yes excel is not data mining tool but it very useful and important tool for data mining process