Now it’s time to create some clusters from our twitter data. In this post, we focus only on biographical tags and we use the old kmeans algorithms in order to find significant clusters. At least we hope so.
Previously in this series:
- Description and how to grab profiles
- First statistics about locations
- How to reduce the number of token
We are looking for 15 clusters. Why this number? It’s a choice I made after some experiments. It’s boring to look at more than 15 clusters and fewer would group things we don’t want to.
Our dataset has tags as columns, with 1 if the tags is used, 0 if not. Kmeans work with a distance. There is Euclidian or cosine-based for instances. Text mining usually use the cosine distance and normalize the dataset so that the Euclidian norm of an instance is equals to one (thus similarity is between 0 and 1). It’s often use to be fair with instance with only a small number of tags. If you suppose an instance with all the tags, without normalization, it would be the most similar instance for each other instance. I don’t think it’s a good idea to normalize in this case. Each description has a very small number of tag and the number of tag is high. Nevertheless, this choice could be discuss.
Here we have a segmentation. Do we win? Not yet I think. Kmeans is an old algorithm and is not so good. There is something which is called stability which is quite important. If you run 10 Kmeans (changing the initial cluster points), you obtain 10 different results. Which is the best? Here we use a kind of cluster similarity. The best segmentation is the one which has very dissimilar clusters.
The result is here. For each cluster, you have the percent of the dataset it represents and a tag cloud with important tags. The bigger the tag, the higher his frequency in the cluster. The blacker (more black) the tag, the most representative it is i.e. it is used quite more often in this cluster than in the overall population.
Cluster 5 represent 11% of the dataset but I can’t find a better description than ‘people’. It contains common people.
Cluster 12 contains people more interested in music hence the tags band, guitar, indie, music, musician, production, rock, soul which are quite important for this cluster (high frequency in the cluster and very higher frequency than the whole population).
Cluster 13 is the news and media cluster : breaking, channel, daily, information, lastest, news, research.
Cluster 1 contains important people, hence their need to say it’s an official account or page. Their provide feed so you can know about them. The most stable cluster, whatever the segmentation you made, you got it.
Cluster 3 is more business. Things like advertising, marketing, management. If you need a guru, an expert or a consultant.
Cluster 8 is more difficult, I call them online addict. They like technology and social.
Cluster 6 is people from the culture industry. journalist, actor, comedian, columnist, singer, presenter, thinker and writer.
Other clusters are not meaningful (I think). Maybe we should go deeper in them? This segmentation does also omit the computer science group and university members group (which you can extract using a different segmentation). Well, clustering is an ill posed problem as we said, it gives you only one vision of the truth.
Let's stay in touch with the newsletter
June 20, 2009 at 19:58
Is it only the information from the profiles or also the last tweets ?
I thought you used the percentage of presence of each token for each user and not just a boolean value… I must have misunderstood.
June 21, 2009 at 11:18
No just profile informations.
In general you use the number of occurrence of a token. But it’s irrelevant in this case (text field too small) so it’s binary in my case (0 and 1 in the computation process). I’ve tried both methods, doesn’t make a significant difference.