Last book I read was Collective Intelligence in Action from Satman Alag (ed. Manning). It covers data mining from a web 2.0 related view. Data is generated by users in many form (ratings, tags, blogs, web pages, …). Such data are not well defined. An user can create a new tag like gloupy without giving you the meaning. There is also some text mining issues. How to understand the meaning of a sentences?
The book is divided in three parts. First (half of the book) describe data and more especially how to get them (web crawling, blog trackers). The second part is about exploiting the data, i.e. data mining (clustering and prediction). There is also a chapter on converting text into tokens. The last part is on examples of applications. Making an intelligent search engine or a recommendation engine (with an interesting discussion on Amazon, Google News and Netflix solutions).
Being based on Java code, it relies upon some libraries like Nutch for web crawling, Lucene for text handling and Weka for the data mining. I think there is too much java code in the book. Indeed, it’s boring an you skip easily some pages. For instance, the book use kmeans with self made code, Weka code and JDM (an data mining java api) code. It seems quite useless to see three times the same thing.
Nevertheless, I have found this book very interesting and a very good introduction to web mining, an area where I have little knowledge of.
Let's stay in touch with the newsletter
Leave a Reply
You must be logged in to post a comment.