twitter1Recently I have discussed about how to get some data from Twitter. At this time, I have downloaded 6859 profiles. Here I will give some information about them. Of course, it’s only a very small subset of the whole twitter community.

First the location field. I list here the 20 most given locations :

+-------------------+-------+---------+
| location          | count | proba(%)|
+-------------------+-------+---------+
|                   |  1787 | 26.0534 |
| london            |   327 |  4.7675 |
| los angeles       |   159 |  2.3181 |
| los angeles ca    |   113 |  1.6475 |
| uk                |    67 |  0.9768 |
| new york          |    65 |  0.9477 |
| london uk         |    55 |  0.8019 |
| usa               |    53 |  0.7727 |
| washington dc     |    47 |  0.6852 |
| new york ny       |    44 |  0.6415 |
| california        |    44 |  0.6415 |
| san francisco ca  |    40 |  0.5832 |
| canada            |    31 |  0.4520 |
| everywhere        |    31 |  0.4520 |
| nyc               |    31 |  0.4520 |
| san francisco     |    30 |  0.4374 |
| chicago           |    28 |  0.4082 |
| la                |    27 |  0.3936 |
| new york city     |    26 |  0.3791 |
| manchester        |    23 |  0.3353 |
+-------------------+-------+---------+

A quarter of the users doesn’t use the location field. The same real location could have many different location field values like Los Angeles which takes values like los angeles, los angeles ca, la, … Using such synonyms, I found that 6.25% of the declared locations are Los Angeles, 9.56% from London and 4.69% from New York. These results are a little too much, there is location called london which are not London in UK for instance, but they are relatively few. It would be interesting to try to extract an OLAP dimension from such data, at least (country, state, city).

Next, I want to see how my twitter subset is unrepresentative from the whole twitter database. I know that using my procedure the probability of a profile to be selected is linear with the number of followers he has. If there is no trouble with Twitter, the number of following link is equals to the number of followers link as it’s a bijective link. If a follows b, the b is followed by a.

In my subset, the average number of followers is around 12,000 and the average number of following is 1,500. On average, each user has 8 times more followers that following. Very far from the real population.Thus my subset could hardly be used to make inferences about the whole population.

The whole correlation between these two attributes is 0.34. Less than I would expect but I suspect this correlation highly depends on the type of user (and currently we doesn’t know the type of each user).


Let's stay in touch with the newsletter

Possible related posts:

  1. Mining Twitter data
  2. A Twitter users segmentation
  3. When is a token a tag?
  4. Data Manipulation Part 1 : SQL
  5. Book review : Hacking Growth