Big Data analysis with Hadoop, Pig and Stack Exchange data dump

On Nov 12, 2012Posted In Cloud Computing,Code Snippets,Hadoop and related,Java Script By admin8 Comments

Big Data analysis with Hadoop, Pig and Stack Exchange data dump

Finally I have finished my latest pet project. I was learning Pig and Hadoop but most importantly I was trying to learn how to finish things. As with most of my pet projects, I start small hoping I will finish that, get smth useful and have something to write about, but usually I never stop adding things and finally I rarely finish anything :). Not finishing your pet projects its not that bad, after all as you always learn something anyway, but this time was different. This time I had finished it and I’m hungry for more :). Well, maybe I’m not completely finished with it but at least I have something to show :).

So, as I mentioned this project was about learning Apache Hadoop and Pig.
To learn how to handle ‘big data’ you need a piece of big data from somewhere, there are some popular database dumps and other form of data you can pick up from, just search stackoverflow for ‘big data examples’ and for sure you’ll find something interesting, I have chosen the data from the Stackexchange itself (including Stackoverflow) as I was interested about the most scored questions and most scored users on some of the SE websites.

I have created my first (longer than 3 lines) Pig script and started playing with it using the Mathematica Stack Exchange data dump as an example, the reason for it is that its actually quite small compared to some other sites, especially StackOverflow (80% of all analyzed data).
Because the SE data is available in the form of XML files (after unpacking) I had to create my custom Pig Data Loader (XML Loader from piggybank didn’t work for me).

@Override
public Tuple getNext() throws IOException {
Tuple tuple = null;
List values = new ArrayList();
try {
boolean notDone = reader.nextKeyValue();
if (!notDone) {
return null;
}
Text value = (Text) reader.getCurrentValue();
if(value != null) {
try{
InputStream is = new ByteArrayInputStream(value.toString().getBytes("UTF-8"));
XMLInputFactory inFactory = XMLInputFactory.newInstance();
XMLStreamReader xmlReader = inFactory.createXMLStreamReader(is);
try {
while (xmlReader.hasNext()) {
if (xmlReader.next() == XMLStreamConstants.START_ELEMENT) {
if (xmlReader.getLocalName().equals("row")) {
parseListItem(values, xmlReader);
}
}
}
} finally {
xmlReader.close();
}
}catch(Exception ex){
ex.printStackTrace();
}
tuple = tupleFactory.newTuple(values);
}

} catch (InterruptedException e) {
// add more information to the runtime exception condition.
int errCode = 6018;
String errMsg = "Error while reading input";
throw new ExecException(errMsg, errCode,
PigException.REMOTE_ENVIRONMENT, e);
}

return tuple;
}

The loader simply reads the data line by line and passes that data into ‘parseListItem’ method. The ‘parseListItem’ method is abstract, so I have created two implementations of this loader, one for Posts and one for Users as I wanted to parse just those two XML files for each site.
The method parsing each line of XML file looks like the following:

@Override
public void parseListItem(List values, XMLStreamReader xmlReader) throws ParseException {
String id = xmlReader.getAttributeValue(null, "Id");
String parentId = xmlReader.getAttributeValue(null, "ParentId");
String postTypeId = xmlReader.getAttributeValue(null, "PostTypeId");
......
//all other fields
......
values.add(id);
values.add(parentId);
values.add(postTypeId);
.......
//same for other fields
.......
}

This way we create a values for the tuple we want to return reading each line of the XML file

tuple = tupleFactory.newTuple(values);

Once we have the data loaded (for Posts and Users XML files) we can play around it and ask questions.
Besides some simple stuff like TOTALs, AVERAGEs, GROUPing Posts by Questions, Answers, Favorited Questions, Answered, Not Answered, Most Scored, Least Scored etc I wanted to find out – how long do you have to usually wait to get the first answer (doesn’t matter if it was the correct one or not) on each site.
For this to work I had to create my very first UDF (user defined function) in Pig.

public class FIRST_ANSWER_AWAIT_TIME extends EvalFunc<Tuple> {

private TupleFactory tupleFactory;

public FIRST_ANSWER_AWAIT_TIME(){
tupleFactory = TupleFactory.getInstance();
}

@Override
public Tuple exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try {
List values = new ArrayList();
String startDateStr = null;
String firstAnswerDateStr = null;

String questionId = (String)input.get(0);
DataBag questionCreationDate = (DataBag)input.get(1);
DataBag answersCreationDates = (DataBag)input.get(2);

Tuple qCreationDate = questionCreationDate.iterator().next();

if(qCreationDate != null){
startDateStr = (String)qCreationDate.get(0);
}

firstAnswerDateStr = getFirstAnswerDate(answersCreationDates);
long timeInSeconds = getTimeInSeconds(startDateStr,firstAnswerDateStr);
values.add(questionId);
values.add(timeInSeconds);
return tupleFactory.newTuple(values);
} catch (Exception e) {
throw new IOException("Caught exception processing input row ", e);
}
}

private String getFirstAnswerDate(DataBag answersCreationDates) throws ExecException {
List<String> dates = new ArrayList<String>();
if(answersCreationDates != null){
Iterator iterator = answersCreationDates.iterator();
while(iterator.hasNext()){
Tuple answerDate = (Tuple) iterator.next();
if(answerDate != null){
dates.add((String)answerDate.get(0));
}
}
Collections.sort(dates);
}
if(dates.size() > 0){
return dates.get(0);
}
return null;
}
private long getTimeInSeconds(final String startDateStr, final String endDateStr) throws ParseException {
if(startDateStr == null || endDateStr == null){
return 0L;
}
Date dateStart = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS", Locale.ENGLISH).parse(startDateStr);
Date dateEnd = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS", Locale.ENGLISH).parse(endDateStr);
long timeStart = dateStart.getTime();
long timeEnd = dateEnd.getTime();
long seconds = (timeEnd - timeStart)/1000;
return seconds;
}
}

This UDF takes tuple with 3 values – question id, creation date and first answer related to this question creation date. To connect answers with their questions I had to use Pig ‘COGROUP’ construct.

The whole apache pig script is available at github, as it takes almost 200 lines (including comments) I will not post it here.

For all the sites available from Stackexchange Data Dump I had to use some Amazon Web Services. I have created 80GB EBS volume and mounted it on a 64bit Ubuntu launched especially for this purpose of downloading and unpacking the data. Once the data was unpacked from the 7zip format I had transferred it to S3 bucket for later processing on Amazon EMR.
The pig script takes the data location as the argument as well as where it should store the results, with Amazon EMR and S3 this is dead simple as you just specify the path in a form of ‘s3://bucketname/your_data_path’.
I had some problems picking up the right instance types for the Core Hadoop nodes but after some test runs I had picked simple ‘large’ instances and modified the hadoop settings for memory intensive applications.

Thats all in terms of big data processing. But what one can do with a bunch of strangely named result files – they are boring, contains ids and nothing sexy in some simple text files.
As usual I decided to expand my pet project a bit and present the form nicely (at least nicely for me – I’m not a web designer and UX guru).
For this purpose I have collected a number of jquery plugins, jquery itself, rickshaw js library for charting solution and created a java program which will take all the data created for each one of 37 stackexchange sites available and throw out nicely looking html file where I can see the results.
For this I have used a velocity templating engine and the results you can see here.

Some screenshots

After the html files are generated this project becomes a static website where you can browse each SE site statistics and navigate to the original questions/answers or user pages on SE network.
I have uploaded generated files to Amazon S3. They are available to browse at: http://stackexchagecharts.s3-website-us-east-1.amazonaws.com/index.html

Comments, especially regarding Pig script optimizations or other ideas like what other questions could be asked against SE Data dump are welcome in the comments section.

Enjoy!

8 responses on “Big Data analysis with Hadoop, Pig and Stack Exchange data dump”

Alexander Kosenkov November 13, 2012 at 1:44 pm

This is amazing! Thank you for sharing such a great work of art =)

One of the very few things I’d change: I’d show questions and answers on the same bar chart. Currently it’s easy to compare different days of week or different times of day for each group.

But it’s hard to answer quite a natural question – how do questions and answers compare? Are times distributed equally? Is there a constant lag? What is the best time to ask the question?
admin November 13, 2012 at 2:32 pm

First of all thx a lot for the suggestions. You can compare the answers and questions (per month basis) on the ‘Overview’ tab for each page. By looking at ‘Answers’ tab you can find out on which days of the week and on which hours would be best to ask questions I believe ;). Otherwise valid points, if ver. 2 will ever come up I will take that into consideration 🙂
Sam S November 13, 2012 at 5:43 pm

Whoa, you’re not a UX expert? 🙂 The UI looks really amazing – did you do that yourself or did you use a software to render/generate it? And all in HTML5 too! 🙂
Ravi November 14, 2012 at 4:03 am

Excellent Analysis, Thanks for sharing. UI looks to good to visualize, I can compare it with the standards of Hue 🙂
admin November 14, 2012 at 10:29 am

I’ve created an index page and a single page template for all the 37 Stackexchange pages, then I wrote a small java utility to parse the results and generate the html files for me automatically, design I did myself but I mainly used freely available jquery plugins and other js libraries. For charts I have used mainly ‘Rickshaw’ lib (http://code.shutterstock.com/rickshaw/) and ‘jqBarGraph’ (http://workshop.rs/jqbargraph/), for sliding panels I have used ‘slidorion’ (http://www.slidorion.com) and of course modified a bit everything to suit my needs.
homepage July 19, 2013 at 8:14 pm

I’m amazed, I have to admit. Rarely do I encounter a blog that’s
equally educative and engaging, and let me tell you, you have hit the nail
on the head. The issue is something too few men and women are speaking intelligently about.
I’m very happy I stumbled across this during my hunt for something relating to this.
SEO March 28, 2014 at 10:56 am

I like thhe helpful info you provide in your articles. I’ll bookmark your blog and check again here regularly.
I am quite sure I will learn lts of new stuff right here!

Bestt off luck for tthe next!
Vijaya March 25, 2017 at 9:41 am

To read each line and converting them is a big overhead especially if you have a big dataset. How do we deal with that? I am currently working with SO data sets and this the problem I have.