AIB Datahack | Instil

Prelude

An overwhelming sense of anxiety woke me up at 3am. I was ill-prepared for the day ahead, it had been a couple of months since I last dabbled with data science - a once cutting-edge instrument, now dulled and rusted. Regardless, I stuffed my mouth with ready-made pasta, and my mind with the latest machine learning articles before bolting off to catch the 5:30 bus to Dublin, where my teammate was patiently waiting for me.

We arrived at AIB’s new headquarters an hour late, due to difficulties with the bus along the way. As we entered, warm welcomes from the staff embraced us from the cold wintery winds; a few recognised us from last year’s triumphant victory. As we settled down, word had gotten around, and the pressure was mounting to meet people's high exceptions and perform this year. Sure, we’ll do what we did last year, clean the data, apply XGBoost, tweak the hyper-parameters, and voilà another win under our belt, I thought to myself as I tried to calm my nerves.

The Challenge

Language Detection, and Sentiment Analysis. A dagger of fear struck me, I had only dealt with numeric data before, and willingly avoided text and natural language processing because I did not want to deal with its messiness. How do I find the outliers? How do I accommodate for spelling mistakes and bad grammar in the data? Emojis, once a delightful decoration, morphed into a non-language conforming, multifaceted monster. This was not going to be easy.

There were two tasks in the competition. First, language detection; we were given a dataset of tweets, with language tags, with this we had to train on the data to build models that would classify the data into the given languages: English, French, German, Italian, Portuguese, Spanish, and Japanese. Second, sentiment analysis; similar to the first task, we were given a dataset of tweets with labels indicating whether they had positive (1) or negative (0) sentiment, and had to train a model to determine whether a new set of unlabelled tweets were either positive or negative. We decided to work on a task each, my teammate would work on the language detection and I would work on the sentiment analysis.

Clean-Cut And Ready For Action

My time as an intern has taught me that cleaning is the most important start of any successful endeavour. We began by removing the user tags (@USER) as this did not provide us with any useful information for our tasks. Then we looked at the hashtags, we removed the octothorps, but we were not too sure on deleting the entire thing, as hashtags seemed to be entirely English, throwing off our language detection, and some tweets that were composed entirely of hashtags; however it would be quite useful in the sentiment analysis #sad #confused, we later decided to keep it in its entirety, instead opting to classify and deal with it in the model. Finally, we ran a word frequency aggregator to find the most used words and their number of occurrences, the usual suspects such as "a", "and", "the" were there, however amidst the sea of data the question masters had sprinkled ‘Website2018’ all over the dataset which appeared at an alarmingly high frequency to throw off our model's accuracy, this was swiftly dealt with. We now had clean data that glinted proudly in the sun, the only problem was we did not know what to do with it.

Did Not Appreciate The Sentiment

The naive approach to sentiment analysis. Initially, I separated the tweets with labels given into two sets of words, then I created a third set from the intersection of the two previous sets. Then I decided to remove the intersect from the two initial sets so that I now had three sets of words: words that were only used in tweets with positive sentiment, words that were only used in negative sentiment, and agnostic words that were neither positive or negative. I did some further cleaning and removed words with the highest frequencies: "a", "the", "and", "it" etc. I then filtered all the tweets through these sets, and if a given tweet contained only words from the positive sets it was positive and vice-versa. Now time to deal with the agnostic set. XGBoost to the rescue! I separated all the words in the corpus of tweets, and put them into their own columns, with each row representing a tweet, with 0 or 1 in the columns depending on whether the tweet contained the word, and finally the last column with its sentiment. Then, I imported Scikit-Learn and began the training. The result of all this hard work... utterly useless predictions, that did slightly better than randomly guessing. Back to square one.

Like A Babel Fish Out Of Water

The naive approach to language detection. My teammate decided to separate all the tweets that used any of the Japanese character set classifying it starting away, easy. Later, on the tweets that used the Latin character set, decided to use a frequentist approach to detect the language. First, creating a corpus for each language, then keeping track of the frequency for each word. A model was created from this. All that was left to do was to feed in the tweets, it would then get the probability scores of the various languages it was tested against, and simply choose the highest probability. This worked fairly well achieving about 70% accuracy, which was a far cry from the 90%s we were seeing on the interim evaluation leaderboard.

Hunger Is Of The mind

Lunchtime. A sense of hopelessness loomed as we munched on our burrito, kindly provided by AIB, our minds filled with static as the electricity of excitement echoed through the competition floor from the leading teams. We struggled to even put up our solutions for judgement, a mistake with the formatting and not having write access to the Google Drive folder, after some consulting with the helpers, we made our much-anticipated debut on the leaderboard. Last place... 5 hours left. Following our disastrous display, several of the staff approached us and asked us about our methodology, trying to instil hope and showering us with encouragement. It worked, a flood of urgency enveloped us, we removed all traces of the previous attempts, and feverishly scoured the internet for inspiration.

Naivety Made Me Smarter

The Naive Bayes approach to language detecting and sentiment analysis. Following our research, we decided to use the Naive Bayes classifier for both the tasks; it did the same thing as our previous solutions, keeping track of the frequency and splitting the words into categories to predict whether a piece of text had positive or negative sentiment, or which language it belonged to - however it was much more mature and smarter about its decisions.

We also found a library to do the stemming/tokenisation of the texts for us, it was the missing piece we needed, and pivotal to our turn of fortune, as currently the same word in different variations were being treated differently, in the theme of sentimentalism a good example would be: love, loving, lovingly, loved, lover, lovely, and love. This allowed us to group words with the same roots/stems together, which made our classifier much more robust against typos, and bad grammar. We lovingly fed the data we had cleaned earlier into our new classifier, hoping the detox had worked - the fans on my laptop whirred up and hummed with activity, training on the data... no errors yet, after five gruelling minutes of anticipations, we breathed a sigh of relief as the program had produced an output of predictions, which we then handed to the judges to check.

The leaderboard had been updated! We bubbled up to 7th place, dread yet again blanketed us. We did not know how else to improve our algorithm. With only an hour left in the competition, the judges had cruelly decided to freeze the leaderboard to increase tension, we would no longer get feedback on our changes.

Datum Ex Machina

We had run out of ideas for algorithmic improvements, that would not result in overfitting. After some heavy thoughts and a bit of soul searching, I made the suggestion of increasing the training data set, bigger data, better result as the old adage goes. Quickly scouring through the web, I found dictionaries for each of the languages, and also a corpus surrounding it, fed that into the training data and to our surprise the language detection had leapt to above 90% accuracy. I asked the judges if this was allowable, they approved.

Upon further scouring, like a rat in the sewers of the internet, I found a dataset on Kaggle with 1.6 million tweets, with their sentiments labelled (positive, neutral, and negative), this was the akin to Indiana Jones discovering the golden idol. Alas, it also proved to be a double-edged sword, with such a large dataset to ingest my laptop froze with fright, and also a lack of memory as I tried to open the CSV file.

Pythonista On A Data Diet

Determined to win, I quickly wrote up a Python script that would ingest the data using the Pandas library, ignore the tweets with the neutral tag, format the dataset, and take a random sample of 10,000 tweets - this was much more manageable. I fed that into the model, and as expected the accuracy improved. 100,000 tweets, my laptop whirred angrily but dutifully produced a model, with yet more improvements in accuracy. 500,000 tweets, my laptop was not happy, with a chorus of protests emitting from the fans, still, it produced a model with even greater accuracy, the improvements were starting to plateau, but improvements nonetheless. At this instance my laptop and I were having a heated argument, so to cool things down, I went over to the drinks chiller, bought back 4 cans of soft drinks, which I made into a bed for the laptop to lay on. 900,000 tweets, complacently my laptop produced a model, which resulted in the best predictions yet of 76% accuracy, this was a 7 points improvement from what we had submitted to the judges earlier. 15 minutes left in the competition, and the training was taking its toll, it took over 10 minutes to crunch through 900,000 tweets. I was worried adding more data would result in us not finishing in time, regardless I continued to torment my laptop. 1.6 million tweets, I could do nothing now, except watch the countdown clock, each second sending a pulse of adrenaline through my veins, finally, a minute before submission my laptop diligently produced a model, but wait something was not right, adding more data decreased our accuracy to 75%. Decision time, should I use the model with the higher accuracy or the one that used more data. In this moment of indecision a familiar mantra began to play in my head, bigger data better results, bigger data better results, bigger data better results, that’s why I chose the bigger data.

We Need To Use Our Brains

Whilst I was busy torturing my laptop, my teammate had the bright idea to manually curate the language detection on predictions that had less than 30% confidence, this would improve our accuracy by a large margin, however, we ran out of time before we could implement this.

Bottoms Up

Still unsure on whether our valiant efforts had improved our standings, we waited as the judges ran through everyone’s submissions. This took quite some time, so to ease our pain we removed ourselves from reality by playing Nidhogg. Regardless of our position on the leaderboard, we were happy with what we had learnt that day, that alone was worth it for us. "Import Answers, from Queen’s University Belfast" I heard one of the judges announce, we had placed second. Overwhelmed with happiness at this unexpected result, we hurried up to the front to collect our big cheques, and a round of applause released all the anxiety I was enduring. We shook hands with the winners and took photos, after a brief talk with each of the judges, it was over. We later learnt that we were only separated on the third significant figure from the winning team, and that we had the best sentiment analysis scores. I was still in shock of having started from the bottom, and to end up here. We cheerfully made our way to the city centre on the LUAS (their version of trams) and got a coach back to Belfast. To celebrate, I collapsed onto my bed and drifted into sleep. I had been awake for 20 hours.

Article By

Samir Thapa

Intern Engineer