Author: mattions

Machine learning empowered blackkiwi

balckkiwi og

A blackkiwi which tells your mood.

Blackkiwi is a django powered website which uses machine learning to tell if you wrote happy or sad things on your latest Facebook status. And it tends to be wayyy positive. But we will get there…

The genesis of blackkiwi

There were two main things combined with the genesis of blackkiwi.

The first it was this curiosity about Natural Text processing and classificaiton techniques. In particular I wanted to write some classifers to see how well they were performing and I also wanted to try to do and test something new.

But I needed some kind of application. This is usually a good trick in programming in general. If you build towards something, it is always easier to stay motivated and actually getting it done, instead of giving up the hobby and end up playing World of tanks on the play :).

The second ingredient was to try to test the release process via gitlab, using automatic push via CI to a server. As stack I wanted to use a classic dokku stack which I’m very happy to use, beccause it basically brings the nice and easy way to deploy similar to heroku/gondor style to your own server.

Last but not least, I wanted to test natural language processing, because I wanted to do something about my facebook feed. Lately, giving maybe to all the political happenings like Brexit, Trump, migration crisis I saw an increase of posts from people being extremely racists, hate-fulled and extremely violent. This toegether with total non-sense and antiscientific claims.

The classic way would be to try to have a conversantion, and try to explain that these positions are unacceptable and also dangerous for the whole community, but this usually ends up in a fight with the trolls, and TBH, I don’t think it is a winnable fight.

However I thought it could be a good idea to try to get something going, where you can get the facebook status and see where you were basically landing. Were your statements close to for example racist individual, or you were more close to intelligent and inspiring characters?

Of course this is quite complicated to build, but I decided that I have to start somewhere, so I settled on an application able to tell if you were happy or sad on Facabook to start.

Blackkiwi: how does it work?

Conceptual there are three main parts:

  1. the kiwi goes to Facebook to get the user’s mood after being authorized (it is a good kiwi)
  2. the kiwi works very hard to try to understand if you were happy or not, and it writes it down
  3. the kiwi then draws this moods on a plot, to show your mood in a timeseries fashion ways.

It’s a pretty clever and hardworking kiwi, our own. I’m not sure what should be the name. feel free to propose one in the comment, if you like.

The computation stack: the classifiers

Two problems needed to be solved here:

  1. we needed a way to connect to facebook and get the moods out in same form, so we could feed them to the classifiers
  2. we had to build, train and then load the classifiers

The first part of the job was quite a new adventure. I never used Facebook Graph Api or created an app on that platform before, so there was a little bit of learning. At then end of several experimentations I’ve settled to use facebook-sdk. Nice piece of software which does most of the job.

For example, our collector class looks like this:

# -*- coding: utf-8 -*-
import logging
import argparse

import facebook
import requests

# create logger
logger = logging.getLogger(__name__)

class FBCollector(object):
 def __init__(self, access_token, user):
 self.graph = facebook.GraphAPI(access_token)
 self.profile = self.graph.get_object(user)
 logger.debug("Collector initialized")

 def collect_all_messages(self, required_length=50):
 """Collect the data from Facebook
 
 Returns a list of dictionary.
 Each item is of the form:
 
 ```
 {'message': '<message text here>', 
 'created_time': '2016-11-12T22:59:25+0000', 
 'id': '10153812625140426_10153855125395426'}
 ```
 
 The `id` is a facebook `id` and it is always the same.
 
 :return: collected_data, a list of dictionary with keys: `message`, `created_time` and `id`
 """
 logger.debug("Message collection start.")
 collected_data = []
 request = self.graph.get_connections(self.profile['id'], 'posts')
 
 while len(collected_data) < required_length:
 try:
 data = request['data']
 collected_data.extend(data)
 # going next page
 logger.debug("Collected so far: {0} messages. Going to next page...".format(len(collected_data)))
 request = requests.get(request['paging']['next']).json()
 except KeyError:
 logger.debug("No more pages. Collection finished.")
 # When there are no more pages (['paging']['next']), break from the
 # loop and end the script.
 break

 return collected_data

if __name__ == "__main__":
 logger.setLevel(logging.DEBUG)
 # create console handler and set level to debug
 ch = logging.StreamHandler()
 ch.setLevel(logging.DEBUG)
 # create formatter
 formatter = logging.Formatter('%(asctime)s|%(name)s:%(lineno)d|%(levelname)s - %(message)s')
 # add formatter to ch
 ch.setFormatter(formatter)
 # add ch to logger
 logger.addHandler(ch)
 
 parser = argparse.ArgumentParser(description='Process some integers.')
 parser.add_argument('access_token', help='You need a temporary access token. Get one from https://developers.facebook.com/tools/explorer/')
 parser.add_argument('--user', help="user with public message you want to parse", default="BillGates")
 args = parser.parse_args()
 fb_collector = FBCollector(args.access_token, args.user)
 messages = fb_collector.collect_all_messages()
 logger.info("Collected corpus with {0} messages".format(len(messages)))

As you can see you need a token to collect the message. This token is obtained by the profile of the facebook user, which will let you collect his/her status. note that you need permissions to do this for real, and your app needs to be approved by Facebook, however you can get the messages of a public user, like Bill Gates in the example, and then get them out in a nice organized list of dictionaries.

So have a way to connect to Facebook, and given we have the right token ™, we can get the status updates out. We’ve got to classify them now…

May the 4th has passed

The classifiers bit is quite complex. First we need to find a corpus, then we need to create the classifiers, then to train them. Then save them, so we can then load them up and use them.

We build the classifiers using the nice NLTK library, together with Scikit-Learn. All the classifiers perform pretty similar, and I decided to go for a voted classifiers, which decided if the text is positive or negative using the majority consensus. Instead of using pickle to save them, we are using dill, ‘caue it plays well with classes.

Once they have been trained, we can load them up and use them. This is the loading function:

def load_classifier(self):
    naive_bayes_classifier = dill.load(open(self.naive_classifier_filename, "rb"))
    MNB_classifier = dill.load(open(self.multinomialNB_filename, "rb"))
    BernoulliNB_classifier = dill.load(open(self.bernoulli_filename, "rb"))
    LogisticRegression_classifier = dill.load(open(self.logistic_regression_filename, "rb"))
    SGDClassifier_classifier = dill.load(open(self.sgd_filename, "rb"))
    LinearSVC_classifier = dill.load(open(self.linear_svc_filename, "rb"))
    NuSVC_classifier = dill.load(open(self.nu_svc_filename, "rb"))

    voted_classifier = VoteClassifier(naive_bayes_classifier,
                              LinearSVC_classifier,
                              SGDClassifier_classifier,
                              MNB_classifier,
                              BernoulliNB_classifier,
                              LogisticRegression_classifier,
                              NuSVC_classifier)
    self.voted_classifier = voted_classifier
    self.word_features = dill.load(open(self.word_features_filename, "rb"))
    logger.info("Classifiers loaded and ready to use.")

and the analyzer API looks like this:

analyzer = Analyzer()
classified, confidence = analyzer.analyze_text("today is a good day! :)")

The computation stack: the web

django meme

Yep. Django. Always. 🙂

These are the installed app in the blackkiwi project

INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    
    'django.contrib.sites',
    
    # our stuff
    'moody', # we are first so our templates get picked first instead of allauth
    'contact',
    
    'allauth',
    'allauth.account',
    'allauth.socialaccount',
    'allauth.socialaccount.providers.facebook',
    'bootstrapform',
    'pipeline',
    
]

All the integration with Facebook is happily handled by the django-allauth which works pretty well, and I suggest you to take a look.

For example, in this case I wanted to override the templates already provided by the django-alluth and I have put our app moody before allauth, so our own templates do get found and picked up by the template loaders before the allauth proided.

So that way, once the user authorize us, we can pick the right ™ token, collect his/her messages, and then score them with the classifiers.

Then we plot them on the site using D3.js, like you can see here.

The deploy is done using gitlab, with testing/staging/production system, using the gitlab CI. But we leave this for another post, ’cause this is way too long anyway.

Have fun!

Bitcoin Surpasses $1,000 Mark and Stays Firm Despite Volatility

 

BTC over 1000$

BTC over 1000$ ATM. Graph from CEX.io

This is a very first guest post for this blog, which is written by Mary Ann Callahan.

Mary is an expert on Bitcoin-related topics, she is working as a Journalist at Cex.io – cryptocurrency exchange. She works on articles related to blockchain security, bitcoin purchase guides or bitcoin regulations in different countries.

She sent me an email asking if my readers could be interested in bitcoins matters, and I said that was totally a possibility, given that I’ve written about bitcoin myself. After few days, she came back to me with a very nice article, which I’m very happy to present to you. Happy reading.

Bitcoin’s impressive rally in the second half of 2016

The start of the year has been very exciting for bitcoin investors. On January 1 st the cryptocurrency surpassed the $1,000 mark for the first time in three years and inched close to its 2013 all-time high of $1,183.59, according to CEX.io.

The aggressive rally that lasted from September of last year to the beginning of January saw the price of bitcoin increase from $600 to over $1,000. The reasons for bitcoin’s impressive rally in the second half of 2016 was due to an accumulation of factors including a weakening Chinese Yuan, increased demand in emerging markets, most notably India, and general bitcoin-positive sentiment in the media.

As China is the biggest market for bitcoin, with over 90% of bitcoin trading volumes occurring on China’s three biggest exchanges BTCC, OKCoin, and Huobi, bitcoin has developed a negative correlation with the value of the Yuan. When the Yuan weakens, bitcoin strengthens as Chinese investors move money into the digital currency and out of their local currency, and vice versa when the Yuan strengthens.

Another big driver of bitcoin’s rally has been the bitcoin boom in India. Ever since Prime Minister Narendra Modi prohibited the circulation of high denomination rupee notes and announced that he wants India to become a cashless society, the demand for bitcoin started to soar. Indian bitcoin exchanges, Unocoin, Zebpay, and Coinsecure have experienced a strong surge in user sign ups and a jump in bitcoin trading volumes since India’s surprise currency reform in early November.

The bitcoin rally was further fuelled by bitcoin-positive news in the media, which has been largely focussed on the hype around its underlying technology, the blockchain. Commercial industries across the board have acknowledged that they can benefit from adopting the distributed ledger technology to reduce costs and inefficiencies as the blockchain provides an excellent system to securely store and transfer data of any kind. The blockchain-positive media coverage has also helped to improve bitcoin’s tarnished reputation and has shone a better light on the cryptocurrency.

Volatility came back thanks to Chinese regulators in early January

After an uninterrupted sharp four-month rally, bitcoin investors received a harsh reminder of how volatility the cryptocurrency actually is when the price of bitcoin dropped from its three-year high by around 25% within a week after news emerged that the Chinese regulator wants to take a closer look at China’s largest bitcoin exchanges and issued a public warning to Chinese citizens about the risks of investing in bitcoin. From January 6 th to January 12 th , the price of bitcoin dropped from $1,153.86 to $768.63 according to BitcoinAverage.

The strengthening of the Yuan and concerns about the Chinese regulator imposing strict bitcoin-unfriendly rules on the three largest bitcoin exchanges in the world caused investors to sell their coins and proceed with more caution. However, it didn’t take long for the price of bitcoin to recover and trade back above the $900 mark.

Some bitcoin experts drew a comparison between the rally of 2013 and the recent rally in bitcoin.

However, the big difference between then and now is that the bitcoin ecosystem has grown much stronger and bigger and there is much more faith in the future of the cryptocurrency. This has helped the price of bitcoin to stay firm despite the negative headlines out of China.

As regulators of the major bitcoin-relevant economies have so far taken a positive or neutral stance towards bitcoin, there is no reason to expect the appreciation of bitcoin’s value to stop increasing anytime soon. Demand for the digital currency as both an investment and as a transactional currency is increasing around the world and so is the infrastructure supporting it.

Unless China drastically changes its view on cryptocurrencies or imposed harsh restrictions on its exchanges, the price of bitcoin should continue to increase in 2017.

All prices were taken from BitcoinAverage.com.

Using pip to check outdated packages

It is always difficult to know if you have the latest stable packages installed in you requirements, and with the fast pace of releases, due either
to new features or security releases, it’s very easy to keep up.

Luckily, latest pip has now the --outdated function, which makes very easy to snipe these packages.

It would be good to have something that finds the outdated packages, installs the new stable and then update the requirements.

To that end, I’ve written this little gist:

It would be very nice to have this used as one of the pre-check before a Pull request, which will give the ability to know if we have the latest packages in the code or not. On that regard I’m pretty excited by this tool called beefore, which I found listening to Talk Python To Me, which in turn I recommend.

2017 is prime, which is kinda of cool

#xmas #holidays in full swing #bella #bologna

A photo posted by Myrto Kostadima (@myrtulina) on


Picture taken from the top of the Asinelli’s tower in Bologna, on a short two days visit when we went back in Italy for Xmas

So I was really waiting on the classic Annual Review which wordpress was sending out every year to write a bit about 2016 (for example this is the one about 2015), but instead this year nothing arrived in my mailbox.

Digging into the problem it seems that WordPress (the company behind it, technically speaking) was actually building these reports by hand, and therefore it was extremely resource intensive process. You can find more info on this thread. The gist is, it was too costly, so they decided to not do it this year, and maybe restart with a more automated system next year. I guess a big thank you is in order for the other annual reviews that were available in the past, then.

But fear not! I decided to look at the stats myself, and while the virtual fireworks will not be available, we can still have a look at my 2016 posting activity. I’ve published a whooping 5 posts in 2016. To be honest I thought I had published fewer posts. It was a good surprise. They have a peculiar distribution: we have 2 in January, 1 in February, 1 in March and 1 in December. I guess there are several factors that have concurred to this kind of pattern, most importantly that I’ve got busy in the middle of the year, I guess with work and with my wedding as well :D. This blog had 46,223 Views and 37,954 Visitors in 2016. Pretty happy with these numbers TBH, given that is a super niche blog talking about very sparse array of arguments that interest me somehow, so readers never know what they are gonna find here.

Interestingly two posts written in 2016 got quite a little bit of traction, in particular one about my Apollo laptop, and another one on how to run Webex on Linux. In details, we had 7,825 Visits for the Webex one, and 1,930 for the Apollo post. So far, my best post is What do I do when my Pull Request does not merge automatically in master? with 17,025 Visits, sporting 10,000 more visits than the best 2016 post.

The bar Cart post didn’t do too bad, with its honest 23 Visits :P. Given that it was published on the 5th of December, it had less time to build up so it is kind of expected. I’m confident it will in time, maybe.

Well then, this was just a quick home-made annual review for 2016, let’s see what is gonna happen in 2017. Once again, Happy New Year!

Home-built bar cart

Bar cart: from the left

Bar cart: from the left

Bar cart from the right

Bar cart from the right

When we moved into our new house, we have decided to build a bar cart, because we always wanted one, but we never had the space and the time to do it.

After looking online, we have found this post of people hacking an ikea laptop table and making a pretty sweet looking bar cart.

So we went into the same process, and try to replicate the same design. It was pretty easy, because most of the components were exactly the same (basically all the ikea stuff maps one to one), with the main difference on the source of the wood.

In the original post this was sourced from Home Depot, here in the UK we got it from Home Base. In case you are wondering we’ve got two pine wooden slates of 50 cm, which we have cut to measure and painted.

We are pretty happy with the result and we finally have a nice place where to store and keep all our drinks 🙂

Pyenv install using shared library

A photo posted by Michele Mattioni (@mattions) on

Random Nice Picture not related with the post. You’re welcome 🙂

I used to have only virtualenvs. Then I moved only to use conda. Then I was on the position that I had to use either one or the other one, and I have happily switched to use pyenv as a way to manage both conda and virtualenv python enviroments. You can always pick both interpreter version Python 2.7 or 3.4.

I have just noticed that my ipython notebook couldn’t acccess the shared sqlite and readline libraries, which is bad, ’cause my history was not saved, and the readline support makes everything a little bit more enjoyable.

After 2 minutes of googling, I found the solution:

$ env PYTHON_CONFIGURE_OPTS="--enable-shared" pyenv install 2.7.10
$ pyenv global 2.7.10

and you are sorted.

I have found the solution on stackoverflow.

How to get WebEx running on an Ubuntu smoothly

Being 2016, there are a lot of ways to get on a video link between people.

While Skype, Viber, Whatsup or anything else can open a video connection and can be used between friends, in business world the options are a little bit more limited.

One of the option that is on par with the time is google hangout, and if your company has google apps, you can set up nice meeting, directly attached to your calendar invitation. It’s awesome and I like it a lot. My choice.

However, in biz space old habits are hard to die, therefore people stick to things like gotomeeting, which is not too bad, or the worse thing ever supported on Linux, WebEx.

To run the WebEx on Linux is a nightmare, to put it mildly. The WebEx is a java application, but they made sure that you can only run the 32 bit, and you can launch the applet only using a Firefox 32 bit installation. As I said, they may have their own reasons, but honestly I don’t really get it, and I think it is super crazy.

After battling with it for at least 4 hours, I found a reproducible way to get it going.

Here are the steps:

1) Install Firefox an 32 bit and some libraries for nice appearance.

sudo apt-get install firefox:i386 libcanberra-gtk-module:i386 gtk2-engines-murrine:i386 libxtst6:i386

2) Download the jre from oracle:

This is the link where you can pick the jre. Get the tar package, not the rpm http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html

3) Create a dedicated dir and untar it there

mkdir ~/32bit
cd ~/32bit
tar xvf ~/Downloads/jre-8u73-linux-i586.tar.gz

4) Add the plugin to firefox

mkdir ~/.mozilla/plugins

5) Link the plugin

ln -vs ~/32bit/jre1.8.0_73/lib/i386/libnpjp2.so ~/.mozilla/plugins/

Now you are all set!

In my test I was able to share the screen, to use the audio from the computer, and everything was working ok.

Good luck, and honestly, if you can, avoid WebEx.

Apollo, a laptop with Linux pre-installed from Entroware

Apollo running latest Ubuntu

Apollo running latest Ubuntu

TL;DR: Get it.

I’m a Linux user for a long time, which is so long that I still know the difference between GNU/Linux and Linux.

My first distribution, jut right off the bat, was a Gentoo, where I was compiling kernels and everything else to have a working system. Having very few idea of what I was doing.

This usually meant having to fight with drivers, search for solutions on forums (sometimes very hidden once) and be a quite technical person.

I have to say I have learned a lot during this time, and actually Linux made me interested again in computer and informatics in general.

cautionary

Long time ago, in far far away galaxy recompiling the kernel was normal, even if you didn’t know what the kernel was!

Time passed and several revolutions do have happened; first, the Ubuntu distributions was founded, and I think it really helped to bring Linux closer to the masses. Of course one of the good ideas was to use Debian as base, however I think the time spent in bringing a coherent user interface towards the general public was what Ubuntu was striving for. The bug number one was closed long time ago, mostly due to the increase of portable computing, and the part that Android has played into it, although I still think that Ubuntu has played a big part.

Second, a big shift on the laptop environment also materialized. Dell was one of the first big retail name to provide a Linux solution laptop, and in particular the XPS 13 which was always a good computer.  Dell was offering to have Ubuntu pre-installed directly from the factory, and that was the choice I made. I had the old XPS, and I had a good experience with it. which meant no license Windows fee. I’ve got one. The motherboard suffered quite a bit of hiccups, but all in all the laptop did its job valiantly.

The new XPS 13 inch looks pretty good, and the project Sputnik is entirely devoted to make sure this laptop runs Ubuntu or other distributions properly. While the XPS 13 is a terrific laptop, two main problems didn’t let me pick it. First: the position of the video camera. I get the small bezel idea and stacking a 13 inches display in what usually it’s a body of an 11 inches is great for portability, however the angle of the camera is desperately bad.

Basically, if you have a video call with somebody, they see the internal of your nose, instead of your fact. If video call are a day to day experience for you, it cannot be an option.

The second problem is the screen. While the colours are amazingly brilliant and the resolution so high that you need a lens to see the icons, the major problem is that in most of the high level configurations you have only the glossy option available

A glossy display reflects all the lights, so even a little sun-ray that hits that screen will turn it into a mirror, with the clear result of decreasing the usage of it. Basically you can’t see what is going on. And that is bad.

Xkcd, laptop hell

With latpots, either you don’t care, or you get extremely opinionated.

So that brings me to the Apollo by Entroware.

Apollo 13 inch, sleek and nice.

Apollo 13 inches, sleek and nice.

A very nice in depth review has been done by the Crocoduck here, so I suggest to visit it there. Here I’m gonna put my general impressions.

Apollo laptop impressions

When you power up the laptop via the dedicated power button which is integrated in the keyboard, you are greeted by the Ubuntu installer. Partitions and most of the stuff is already done for you, what you’ve got to do is to pick is the username, and the timezone.

After that you are greeted by a standard Ubuntu installation. Everything goes out of the box, in particular:

  • Wifi can just be used via Network Manager
  • USB port are working: I have even bought a USB 3.0 to ethernet dongle, and it was just plug and play
  • All the Fn keys (backlit keyboard, Bluetooth, screen brightness and so forth) do work.
  • Suspend works out of the box, without any problems. I’ve noticed that the wifi sometimes does not get back properly, but it easily fixed restarting network-manager : systemctl restart network-manager.service

Specs: i7 CPU on skylake bridge, 500 Gb SSD, 8 Gb RAM, with a 1920×1080 display (which I run @ 1600×900 to actually have a bigger text), weighting a bit less than 1.5 Kg. Everything for £824, which is a honest prize I think.

The keyboard is very comfortable and nice. The keys do have a nice feeling and it’s not tiring to write on it. The touchpad is ok, the tap works great and the sensibility looks good. Clicking is doable, however it’s one of this integrated touchpad, so it will never be as good as a normal touchpad with physical buttons.

So if you are in the market for a sleek portable laptop running linux, I totally suggest to check the Entroware laptops.

 

Pills of FOG

 

We’re out of the FOG! The Festival of Genomics has just concluded yesterday and it was a blast.

I was at the SevenBridges booth, showing researchers how it is possible to actually create reproducible pipelines using open technologies, like the CWL and Docker.

The festival was very good, I totally suggest you to read the post by Nick on the company blog (day1, day2)

I’m very pleased to say that researchers and scientists are really looking for a solution to encode and make sure that complex pipeline can be replicated. Replicating scientific experiments is always a good idea, and having a way to describe them in a programmatic way, so they can be given to a computer directly, its a big step forward.

If you ever wrote a pipeline, you need that thing can get messy. In the best case scenario you have a bash script wrapper, that calls some executable with some parameters, and it may take some arguments.

If it is very bad, you may have some custom perl (or python) scripts that call some custom bash script that relay on some hard-coded paths which then launch executable that can be run on only certain version of software on a certain type of cluster, with some compiled options.

And, unbelievable as it sounds, the second option is very common, and the number of custom software and script involved is very high.

However, it does not matter how complicated your pipeline is, how obscure the program you use are, or how many ad-hoc script you are using, you can wrap all of them and express them and share it using the CWL, hinging on custom docker images.

For example, take a look at this classic bwa+gatk pipeline for calling variant (you may have to make a user on the platform to see it. Do not worry, it is free). Even with more than 20 steps, all the software, parameters and computation environment can be tracked and most importantly reproduced.

Any software can be ported on this language and expressed, the only requirement is that you can run it on a linux environment, so you can dockerize it.

Have a go, we may get over these cowboys day, and start to reprouce results down to the byte!

2015 in review

2016

Happy New Year!

New Eve Years is upon us once more, and this is a good time to do the classic yearly review.

First of all this is a success story. Last year I’ve decided to write more, and I’ve actually managed to do it. We went from one post only in 2014 to a grand total of 19 posts this year. Not bad at all.

A very quick round down of the stats: my classic workhorse, the pull request rescue post is still going strong and it is the point of entry from google to this blog. This year a new entry has come along: speed up list intersection in python.  It proved to be quite  popular and it is standing its ground even if it is quite new. A lot of other posts this year have been relatively popular, like moving the blog to dokku and some dokku advise as well.

Lots of things have happened, both in my personal and work life. It’s a great time of changing, and new adventures are going to start very soon. As usual this blog will remain mostly about scientific and work related subjects, but I expect to write more about bioinformatics and docker in the future.

Last but not least, this is the generated annual report for the 2015.

2016 is looking very exciting, dense and very packed. I hope I can still write the odd post, but as usual we will see next year.

In the mean time, Happy New Year!

P.S.: Yep , Santa Claus and the snow will go away after the holiday, do not panic.