balckkiwi og

A blackkiwi which tells your mood.

Blackkiwi is a django powered website which uses machine learning to tell if you wrote happy or sad things on your latest Facebook status. And it tends to be wayyy positive. But we will get there…

The genesis of blackkiwi

There were two main things combined with the genesis of blackkiwi.

The first it was this curiosity about Natural Text processing and classificaiton techniques. In particular I wanted to write some classifers to see how well they were performing and I also wanted to try to do and test something new.

But I needed some kind of application. This is usually a good trick in programming in general. If you build towards something, it is always easier to stay motivated and actually getting it done, instead of giving up the hobby and end up playing World of tanks on the play :).

The second ingredient was to try to test the release process via gitlab, using automatic push via CI to a server. As stack I wanted to use a classic dokku stack which I’m very happy to use, beccause it basically brings the nice and easy way to deploy similar to heroku/gondor style to your own server.

Last but not least, I wanted to test natural language processing, because I wanted to do something about my facebook feed. Lately, giving maybe to all the political happenings like Brexit, Trump, migration crisis I saw an increase of posts from people being extremely racists, hate-fulled and extremely violent. This toegether with total non-sense and antiscientific claims.

The classic way would be to try to have a conversantion, and try to explain that these positions are unacceptable and also dangerous for the whole community, but this usually ends up in a fight with the trolls, and TBH, I don’t think it is a winnable fight.

However I thought it could be a good idea to try to get something going, where you can get the facebook status and see where you were basically landing. Were your statements close to for example racist individual, or you were more close to intelligent and inspiring characters?

Of course this is quite complicated to build, but I decided that I have to start somewhere, so I settled on an application able to tell if you were happy or sad on Facabook to start.

Blackkiwi: how does it work?

Conceptual there are three main parts:

  1. the kiwi goes to Facebook to get the user’s mood after being authorized (it is a good kiwi)
  2. the kiwi works very hard to try to understand if you were happy or not, and it writes it down
  3. the kiwi then draws this moods on a plot, to show your mood in a timeseries fashion ways.

It’s a pretty clever and hardworking kiwi, our own. I’m not sure what should be the name. feel free to propose one in the comment, if you like.

The computation stack: the classifiers

Two problems needed to be solved here:

  1. we needed a way to connect to facebook and get the moods out in same form, so we could feed them to the classifiers
  2. we had to build, train and then load the classifiers

The first part of the job was quite a new adventure. I never used Facebook Graph Api or created an app on that platform before, so there was a little bit of learning. At then end of several experimentations I’ve settled to use facebook-sdk. Nice piece of software which does most of the job.

For example, our collector class looks like this:

# -*- coding: utf-8 -*-
import logging
import argparse

import facebook
import requests

# create logger
logger = logging.getLogger(__name__)

class FBCollector(object):
 def __init__(self, access_token, user):
 self.graph = facebook.GraphAPI(access_token)
 self.profile = self.graph.get_object(user)
 logger.debug("Collector initialized")

 def collect_all_messages(self, required_length=50):
 """Collect the data from Facebook
 Returns a list of dictionary.
 Each item is of the form:
 {'message': '<message text here>', 
 'created_time': '2016-11-12T22:59:25+0000', 
 'id': '10153812625140426_10153855125395426'}
 The `id` is a facebook `id` and it is always the same.
 :return: collected_data, a list of dictionary with keys: `message`, `created_time` and `id`
 logger.debug("Message collection start.")
 collected_data = []
 request = self.graph.get_connections(self.profile['id'], 'posts')
 while len(collected_data) < required_length:
 data = request['data']
 # going next page
 logger.debug("Collected so far: {0} messages. Going to next page...".format(len(collected_data)))
 request = requests.get(request['paging']['next']).json()
 except KeyError:
 logger.debug("No more pages. Collection finished.")
 # When there are no more pages (['paging']['next']), break from the
 # loop and end the script.

 return collected_data

if __name__ == "__main__":
 # create console handler and set level to debug
 ch = logging.StreamHandler()
 # create formatter
 formatter = logging.Formatter('%(asctime)s|%(name)s:%(lineno)d|%(levelname)s - %(message)s')
 # add formatter to ch
 # add ch to logger
 parser = argparse.ArgumentParser(description='Process some integers.')
 parser.add_argument('access_token', help='You need a temporary access token. Get one from')
 parser.add_argument('--user', help="user with public message you want to parse", default="BillGates")
 args = parser.parse_args()
 fb_collector = FBCollector(args.access_token, args.user)
 messages = fb_collector.collect_all_messages()"Collected corpus with {0} messages".format(len(messages)))

As you can see you need a token to collect the message. This token is obtained by the profile of the facebook user, which will let you collect his/her status. note that you need permissions to do this for real, and your app needs to be approved by Facebook, however you can get the messages of a public user, like Bill Gates in the example, and then get them out in a nice organized list of dictionaries.

So have a way to connect to Facebook, and given we have the right token ™, we can get the status updates out. We’ve got to classify them now…

May the 4th has passed

The classifiers bit is quite complex. First we need to find a corpus, then we need to create the classifiers, then to train them. Then save them, so we can then load them up and use them.

We build the classifiers using the nice NLTK library, together with Scikit-Learn. All the classifiers perform pretty similar, and I decided to go for a voted classifiers, which decided if the text is positive or negative using the majority consensus. Instead of using pickle to save them, we are using dill, ‘caue it plays well with classes.

Once they have been trained, we can load them up and use them. This is the loading function:

def load_classifier(self):
    naive_bayes_classifier = dill.load(open(self.naive_classifier_filename, "rb"))
    MNB_classifier = dill.load(open(self.multinomialNB_filename, "rb"))
    BernoulliNB_classifier = dill.load(open(self.bernoulli_filename, "rb"))
    LogisticRegression_classifier = dill.load(open(self.logistic_regression_filename, "rb"))
    SGDClassifier_classifier = dill.load(open(self.sgd_filename, "rb"))
    LinearSVC_classifier = dill.load(open(self.linear_svc_filename, "rb"))
    NuSVC_classifier = dill.load(open(self.nu_svc_filename, "rb"))

    voted_classifier = VoteClassifier(naive_bayes_classifier,
    self.voted_classifier = voted_classifier
    self.word_features = dill.load(open(self.word_features_filename, "rb"))"Classifiers loaded and ready to use.")

and the analyzer API looks like this:

analyzer = Analyzer()
classified, confidence = analyzer.analyze_text("today is a good day! :)")

The computation stack: the web

django meme

Yep. Django. Always. 🙂

These are the installed app in the blackkiwi project

    # our stuff
    'moody', # we are first so our templates get picked first instead of allauth

All the integration with Facebook is happily handled by the django-allauth which works pretty well, and I suggest you to take a look.

For example, in this case I wanted to override the templates already provided by the django-alluth and I have put our app moody before allauth, so our own templates do get found and picked up by the template loaders before the allauth proided.

So that way, once the user authorize us, we can pick the right ™ token, collect his/her messages, and then score them with the classifiers.

Then we plot them on the site using D3.js, like you can see here.

The deploy is done using gitlab, with testing/staging/production system, using the gitlab CI. But we leave this for another post, ’cause this is way too long anyway.

Have fun!