![]()
Twitter is a famous social website. It works like a blog but limits the message length (160 characters). Thus, it is also called micro blogging and should be get more frequent update about every thought you could have. Could we do something of such atrophied data?
I’m only at the begining of this project. I have settle a basic crawl infrastructure in order to extract a dataset from twitter and mine in it.
The taken data have five attributes : user name, location, followers count, following count, biography (a small who am i field) and the concatenation of theirs last messages. Below is a exemple of a profile, a public person named Richard Bacon. In this example, you could figure how complex these information are. The location is quite unclear (GPS coordinates). The biography is quite small (but really clear on this example). And the content is … confusing.
id: 1351
name: richardpbacon
location: iphone 51.511682 0.224661
nbFollowing: 72
nbFollowers: 360574
bio: minor celebrity bbc radio fivelive presenter
content: yep she tweeted sunday her tweet alone theyd have
run monday news 10 asking susan boyle backlash she overrated
sounds like someone team listened 5live way work sounds like
someone news 10 team (...)
Actually, the content field displayed above was already treated. I’ve use Lucene in order to tokenize and clean the text part. Bellow is the text before and after applying Lucene in order to get tokens instead of free form text.
before : News at 10 asking, is there a Susan Boyle backlash / is she overrated? Sounds like someone on the team listened to 5live on the way to work. after : news 10 asking susan boyle backlash she overrated sounds like someone team listened 5live way work
As you can see, there is still a lot of meaningless tokens like 5live.
I have done a quick (not so much data, not a god algorithm, not so much cleaning) segmentation on only the biography tokens. Nevertheless, trying with 25 clusters, things start to emerge. For instance, a cluster has a high relative frequency of tokens like university, engineering, computer, student, science, studying, school. This is a students cluster (3% of my dataset). There is also a cluster for official public people (twitter, page, official, feed), some geeks clusters (one for geek users of mac or linux, one for open source software developers, another for web developers), a companies twitter account cluster (tokens like company, services, production, advertising, leading) and a photographs one (photography, make-up, light, photo, traveler).
More work has to be done, but the first insight are encouraging.
Let's stay in touch with the newsletter
June 4, 2009 at 10:08
can you make the source available for download plz ?
June 4, 2009 at 14:46
Here is some code. I use my own data mining framework, but it should be easy to port it for Weka.
package esotech.experiments.twitter; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.StringReader; import java.net.URL; import java.sql.Connection; import java.sql.SQLException; import java.util.Iterator; import java.util.LinkedList; import java.util.Queue; import java.util.Set; import java.util.TreeSet; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.jdom.Attribute; import org.jdom.Document; import org.jdom.Element; import org.jdom.JDOMException; import org.jdom.Namespace; import org.jdom.filter.Filter; import org.jdom.input.SAXBuilder; import esotech.data.Dataset; import esotech.data.attribute.AttributeTypeNumeric; import esotech.data.attribute.AttributeTypeString; import esotech.data.dataset.DatasetJdbc; import esotech.data.dataset.DatasetStructure; import esotech.data.exceptions.DataException; public class TwitterExtractor { public static final String TWITTER_BASE = "http://twitter.com/"; private Connection connection = null; private String rootAccount = "RedheadRhapsody"; private StandardAnalyzer analyzer = new StandardAnalyzer(); private Set already = new TreeSet ();
private int amount = 10;
public Dataset extract() {
DatasetStructure structure = new DatasetStructure();
structure.addAttribute("name", new AttributeTypeString());
// structure.addAttribute("pseudo", new AttributeTypeString());
structure.addAttribute("location", new AttributeTypeString());
structure.addAttribute("nbFollowing", new AttributeTypeNumeric());
structure.addAttribute("nbFollowers", new AttributeTypeNumeric());
structure.addAttribute("bio", new AttributeTypeString());
// structure.addAttribute("bioTokens", new AttributeTypeString());
structure.addAttribute("content", new AttributeTypeString());
// structure.addAttribute("contentTokens", new AttributeTypeString());
DatasetJdbc dataset = new DatasetJdbc("twitter", structure, connection, false);
Queue queue = new LinkedList ();
queue.add(rootAccount);
int count = amount;
while(queue.size() > 0 && count > 0) {
String user = queue.poll();
System.out.println(queue.size() + " - " + count + " - " + user);
if(parseUser(user, dataset, queue)) {
count--;
}
// don't push twitter too far.
/* try {
Thread.sleep(36000);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
*/
}
//System.out.println(dataset);
return dataset;
}
private boolean parseUser(String user, Dataset dataset, Queue queue) {
URL url;
try {
url = new URL(TWITTER_BASE + user);
InputStream in = url.openStream();
SAXBuilder sxb = new SAXBuilder();
org.jdom.Document document = null;
// Building the XML tree as a org.jdom.Document
document = sxb.build(new InputStreamReader(in));
String location = getLocation(document);
String bio = getBio(document);
String content = getContent(document);
int nbFollowing = getFollowingCount(document);
int nbFollowers = getFollowersCount(document);
dataset.setCurrentInstanceCellString(0, user);
dataset.setCurrentInstanceCellString(1, location);
dataset.setCurrentInstanceCellDouble(2, nbFollowing);
dataset.setCurrentInstanceCellDouble(3, nbFollowers);
dataset.setCurrentInstanceCellString(4, bio);
dataset.setCurrentInstanceCellString(5, content);
dataset.addCurrentInstance();
queueConnection(document, queue, already);
} catch(DataException e) {
throw new RuntimeException(e);
} catch(org.jdom.input.JDOMParseException e) {
return false;
} catch (JDOMException e) {
return false;
} catch (IOException e) {
return false;
}
return true;
}
private String getContent(Document document) {
Iterator it = document.getDescendants(new Filter() {
public boolean matches(Object obj) {
if(!(obj instanceof Element))
return false;
Element e = (Element)obj;
if(!e.getName().equals("span"))
return false;
Attribute a = e.getAttribute("class");
if(a == null)
return false;
if(!a.getValue().equals("entry-content"))
return false;
return true;
}
});
StringBuffer sb = new StringBuffer();
while(it.hasNext()) {
Element e = (Element)it.next();
sb.append(e.getText());
if(it.hasNext()) {
sb.append(" ");
}
}
String result = tokenizeString(sb.toString());
if(result.length() < 1500)
return result;
else return result.substring(0, 1500);
}
private void queueConnection(Document document, Queue queue, Set already) {
Iterator it = document.getDescendants(new Filter() {
public boolean matches(Object obj) {
if(!(obj instanceof Element))
return false;
Element e = (Element)obj;
if(!e.getName().equals("span"))
return false;
Attribute a = e.getAttribute("class");
if(a == null)
return false;
if(!a.getValue().equals("vcard"))
return false;
return true;
}
});
while(it.hasNext() && queue.size() < amount) {
Element e = (Element)it.next();
Element ec = e.getChild("a", Namespace.getNamespace("http://www.w3.org/1999/xhtml"));
String userCon = ec.getAttributeValue("href").substring(1);
if(already.contains(userCon))
continue;
queue.add(userCon);
already.add(userCon);
}
}
private int getFollowingCount(Document document) {
Iterator> it = document.getDescendants(new Filter() {
/**
*
*/
private static final long serialVersionUID = 3237010317631492835L;
public boolean matches(Object obj) {
if(!(obj instanceof Element))
return false;
Element e = (Element)obj;
if(!e.getName().equals("span"))
return false;
Attribute a = e.getAttribute("id");
if(a == null)
return false;
if(!a.getValue().equals("following_count"))
return false;
return true;
}
});
Element e = (Element)it.next();
return Integer.parseInt(e.getText().trim().replaceAll(",", ""));
}
private int getFollowersCount(Document document) {
Iterator it = document.getDescendants(new Filter() {
public boolean matches(Object obj) {
if(!(obj instanceof Element))
return false;
Element e = (Element)obj;
if(!e.getName().equals("span"))
return false;
Attribute a = e.getAttribute("id");
if(a == null)
return false;
if(!a.getValue().equals("follower_count"))
return false;
return true;
}
});
Element e = (Element)it.next();
return Integer.parseInt(e.getText().trim().replaceAll(",", ""));
}
private String getBio(Document document) {
Iterator it = document.getDescendants(new Filter() {
public boolean matches(Object obj) {
if(!(obj instanceof Element))
return false;
Element e = (Element)obj;
if(!e.getName().equals("span"))
return false;
Attribute a = e.getAttribute("class");
if(a == null)
return false;
if(!a.getValue().equals("bio"))
return false;
return true;
}
});
try {
Element e = (Element)it.next();
return tokenizeString(e.getText().substring(2));
} catch (java.util.NoSuchElementException e) {
return "";
}
}
private String getLocation(Document document) {
Iterator it = document.getDescendants(new Filter() {
public boolean matches(Object obj) {
if(!(obj instanceof Element))
return false;
Element e = (Element)obj;
if(!e.getName().equals("span"))
return false;
Attribute a = e.getAttribute("class");
if(a == null)
return false;
if(!a.getValue().equals("adr"))
return false;
return true;
}
});
try {
Element e = (Element)it.next();
return tokenizeString(e.getText());
} catch (java.util.NoSuchElementException e) {
return "";
}
}
public void setRootAccount(String rootAccount) {
this.rootAccount = rootAccount;
}
public String getRootAccount() {
return rootAccount;
}
private String tokenizeString(String input) {
StringBuilder sb = new StringBuilder();
TokenStream stream = analyzer.tokenStream("t", new StringReader(input));
Token t = new Token();
try {
while(stream.next(t) != null) {
sb.append(t.term().replaceAll("'", ""));
sb.append(" ");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return sb.toString();
}
public void setAmount(int amount) {
this.amount = amount;
}
public int getAmount() {
return amount;
}
public Connection getConnection() {
return connection;
}
public void setConnection(Connection connection) {
this.connection = connection;
}
public void setAlready(Set already) {
this.already = already;
}
}