Recently I have discussed about how to get some data from Twitter. At this time, I have downloaded 6859 profiles. Here I will give some information about them. Of course, it’s only a very small subset of the whole twitter community.
First the location field. I list here the 20 most given locations :
+-------------------+-------+---------+ | location | count | proba(%)| +-------------------+-------+---------+ | | 1787 | 26.0534 | | london | 327 | 4.7675 | | los angeles | 159 | 2.3181 | | los angeles ca | 113 | 1.6475 | | uk | 67 | 0.9768 | | new york | 65 | 0.9477 | | london uk | 55 | 0.8019 | | usa | 53 | 0.7727 | | washington dc | 47 | 0.6852 | | new york ny | 44 | 0.6415 | | california | 44 | 0.6415 | | san francisco ca | 40 | 0.5832 | | canada | 31 | 0.4520 | | everywhere | 31 | 0.4520 | | nyc | 31 | 0.4520 | | san francisco | 30 | 0.4374 | | chicago | 28 | 0.4082 | | la | 27 | 0.3936 | | new york city | 26 | 0.3791 | | manchester | 23 | 0.3353 | +-------------------+-------+---------+
A quarter of the users doesn’t use the location field. The same real location could have many different location field values like Los Angeles which takes values like los angeles, los angeles ca, la, … Using such synonyms, I found that 6.25% of the declared locations are Los Angeles, 9.56% from London and 4.69% from New York. These results are a little too much, there is location called london which are not London in UK for instance, but they are relatively few. It would be interesting to try to extract an OLAP dimension from such data, at least (country, state, city).
Next, I want to see how my twitter subset is unrepresentative from the whole twitter database. I know that using my procedure the probability of a profile to be selected is linear with the number of followers he has. If there is no trouble with Twitter, the number of following link is equals to the number of followers link as it’s a bijective link. If a follows b, the b is followed by a.
In my subset, the average number of followers is around 12,000 and the average number of following is 1,500. On average, each user has 8 times more followers that following. Very far from the real population.Thus my subset could hardly be used to make inferences about the whole population.
The whole correlation between these two attributes is 0.34. Less than I would expect but I suspect this correlation highly depends on the type of user (and currently we doesn’t know the type of each user).
Let's stay in touch with the newsletter
June 6, 2009 at 21:01
I think the way you crawle the profiles is too much biased, you should consider alternative strategy instead of picking up a seed and follow the followers.
Here a simple idea, Twitter provides a tool named twitter search. You could try to make requests and use the profiles of response. Of course the results of one request will return people sharing at least one interest. However, if you manage to do requests different enough, you could get very different profiles.
Of course you’ll have to decide the token of set of token you’ll search for. You could use a dictionary if you want to orientated your crawling, of you could extract tokens from a news website http://news.google.com/.
You could also use both strategies together, make a search to initialize various seed, and then crawl the followers.
June 6, 2009 at 21:01
forgot the link : http://search.twitter.com/
June 7, 2009 at 19:44
That’s an idea I have considered. Nevertheless, using the search tool would bias the dataset with people more actives than the others. I would also be a little more complex to crawl. Lastly, it would double the number of request on twitter.
I think that whatever the bias is, it will only change the size of clusters not their existence. I have some ideas to correct the bias later.