Sometimes (well most of the time) using your favorite data mining methods and the more obvious attributes are not good enough. What to do then? An usual idea is to use every other models your software provides and/or add every attributes you could think of whatever their relation to your problem. In this post, I will try to elaborate a kind of “how to” for this case.
Step 1 : What is my model?
If your model is a neural network, it’s quite hard to get any insight of how it works by looking at the weights or neural functions. How could you improve something you don’t understand?
Step 1.1 : Is there an human version please?
There is two models which are easy to understand : decision trees (for classification) and linear regression (for regression). It cost nothing to use them. If their give results close to the initial model, they could be a good estimation and catch most of the inner model logic.
If results of such a simple model are too far from your model, you could consider using ensembles of models learned on different subsets of the learning set.
Step 1.2 : Does it use the attributes I provide?
When you got this simpler models, you get insight on what attributes are used and how. In a linear regression you could look a p-values to know whose attributes have an impact. Be aware that p-value could be misleading when having colinear attributes.
Step 2 : Where and why is my model failing?
Maybe your model works fine half of the time, but really fails on some cases. Try to find where the model makes bigger mistakes.
Using the simpler version in step 1, you should be able to process manually the model on these failing cases and see why it doesn’t work.
Step 3 : What can I do?
At the time of writing I figure two main action that could be used to improve the model : adding more attributes and segmenting the problem. There is another which doesn’t directly improve the model, but improve the results, I call it cheating (in a machine learning perspective).
A new idea of attributes can arise when finding where the model fails. But in general, you can’t get this attribute. For instance, if you need the oil price three month ahead it would be quite challenging to find. And finally, if you are able to predict it, you will get rich enough to forget the initial problem and drink Mojito on a beach all day long.
Segmenting the problem is the idea of making a separate model for subset of your population which behave differently. It’s like making a decision tree to choose which model to use, if the underlying models are decision trees, it would be generally useless to do it manually.
Cheating can be used when you know something about your problem but your model can’t express it. Maybe it’s a multiplicative factor. Of course a linear regression can’t do the trick. Thus, you could use a filter approach (pre-processing and post-processing) to use it.
Step 4 : What if it does not work?
When you are here, i see only one solution, use genetic programming, give it access to all the data you have, every possible mathematical, logical or whatever functions and wait. You could look at GenIQ pages for a description. 99.9% of the time I think it will not work. But if you have an unused computer you can run it and come back some days, weeks, months after. In the same time you could do something else. It’s computer at work.
Another idea?
This only present the process I’m actually using. If you have another idea, please feel free to leave a comment.
Let's stay in touch with the newsletter
July 29, 2009 at 18:23
Sometimes one needs to know when to give up, too. Recall that the purpose of empirical modeling is to approximate and underlying behavior, not to force it. If the available data are insufficient to the task, then that is what the data miner should report.
July 30, 2009 at 23:17
Thanks for your comment. You are of course right, but I don’t know any statistical test which tell you can’t do significantly better. Maybe thinking of an unexpected attribute could bring new insights on the business.
It’s not an easy decision to say stop. It will always be too soon or to late.
If you have many to problems to solve, a breadth first strategy should be better than a depth first one. Find a first solution for each, then look for something better.
September 23, 2009 at 12:03
“… use genetic programming … You could look at GenIQ pages for a description. 99.9% of the time I think it will not work”
Good point! Just what would people behind the GenIQ say 🙂