2013 in review

With the year wrapping up, the usual report from WordPress is ready to be published

So I’ll use the occasion to just to write few words.

A lot of things have happened this year and I have learned quite a few tricks, but I didn’t have the time to blog about them.

I’ve spent most of my year working with data, in a new company where the team is very strong and the work is fun. Mostly doing machine learning on big data at high performances. Challenging but fun.

I didn’t have the time to blog as much as in the past, but few posts have been written about Ipython notebooks, django and others topics.

I’ll see if I manage to write a bit more about datascience and what we do next year, at least to some extend.

So let me wrap it up, wishing all my readers a happy new year and good luck!

To enjoy the 2013 annual report click here.

Ipython notebook ans some statistical distributions

Bernoulli distribution

It was quite a bit that I wanted to have a go to play with the ipython notebook, but I wanted to do it with something that was quite interesting and useful.

The IPython Notebook is a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document

- from the docs

This means that you can document your process and or exploration using Markdown, which is than beautiful rendered in html, and have also python code executed, with the graphs that are going to be embedded and will stay in the document.

I think it’s a very valuable tool, in particular when you are doing exploratory work, because the process of discovery can be documented and written down, and it a great way to write interactive tutorials.

For example, in this notebook, I’ve plotted the Probability Density Function of several statistical distributions, to have an idea how they are shaped, and which one to pick as base when creating new bayesian model.

You can see how it looks like on nbviewer

exponantial distribution

Google Cloud free trial coming to end

google_cloud_pic

I received an e-mail yesterday that the Google Cloud free trial period is coming to an end.

This means that from the 1st of June onwards, every instance needs to be paid, starting with the smallest D0.

Loquacius was running on google app engine, and it was a test to see how the new Cloud Sql was behaving with a classic Django website. Given the fact this was just a test, I’ve decided to switch it off.

I’ve downloaded the fixtures of the blog (just three entries to test the blog) and switched the database off, disabling the billing and deleting the D0 instance.

The code is still available on github but unfortunately the blog engine will not be run live anymore from google app engine.

You can still do it on your machine, or have a look how it was on this blog post.

Ggplot2 graph style with matplotlib

Gg2plot is an amazing library to plot and it’s available for R to create stunning graphs. GGplot2 takes a different approach from the classic library, and instead of offering a classic line/points approach permits to combine these elements (example), which is a similar root took by D3js. If you are using the scientific python stack (matplotlib, numpy, scipy, ipython) you have the very good matplotlib to plot and have all your graph app.

For example a bunch of sin and cosine generated by the following code:

look like this:

classic_matplotlib

Instead if we set up a ggplot2 style, the graph looks like this:

matplotlib_ggplot2_style

You may prefer one or the other. Anyway if you like the last one, just download this matplotlibrc and save it as ~/.matplotlib/matplotlibrc, and all your graph will have that style as default.

The matplotlibrc has been inspired by this post, I’ve just updated with the latest matplotlibrc from matplotlib 1.2.1 version.

Have fun!

Edit: Bonus plot, code in the gist.

exp_and_log

GeoFestival powering the E-luminatefestival website

Image

So in two weeks (working like maniacs) we managed to create and set up a pretty cool system to provide a platform for festivals which have different events that spawns in multiple places at different time.

Right now the platform powers the first happening of the E-LuminateFestival. It is built using Django, no, not this Django, this Django, it uses Leaflet for the map integration, using the geographical data from OpenStreetMap. It’s mobile friendly with the use of Bootstrap and some help from us as well.

The events can be submitted only by participants who are approved by the administrator of the site. The code is available under GPL, but if you want us to set it up and tailor it for your event, just give us a shout.

What do I do when my Pull Request does not merge automatically in master?

Github makes very easy to collaborate with people, however sometime it’s a bit complicated to understand how to use Pull Request, and in particular how to make sure that the feature branch can be merged in master in a Fast Forward way

So let’s se how we can go from this(Or the famous “can’t be merged automatically”)

Pull Request cannot be merged

 

to this: (Or yeah, this looks good)

PullRequestCanbeMerged

 

Why this happens

The problem is that both in master and in your branch some files have been changed, and their going in different directions. The content of the file in master is different from the one in your feature_branch, and git does not know which one to pick, or how to integrate them.

To solve this, you need to

  1. Get the latest upstream/master
  2. Switch to your master
  3. merge the latest master in your master (Never develop in master, always develop in a feature branch)
  4. switch to your feature branch
  5. merge master in your feature branch
  6. solve all the conflicts: this is where you decide how to integrate the conflicting files, and this can be done only by you because you know what you did, you can figure out what happened in master, and pick the best way to integrate them.
  7. commit all the changes, after all the conflicts are solved
  8. push your feature branch to your origin: the Pull Request will automatically update

Talk is cheap, show me the commands (cit. adapted)

If you didn’t already add the upstream to your repo, have a read to this

1. Fetching upstream
git fetch upstream
2. Go to your master. Never develop here.
git checkout master
3. Bringing your master up to speed with latest upstream master
git merge upstream/master
4. Go to the branch you are developing
git checkout my_feature_branch
5. It will not be fast forward
git merge master
6. Solve the conflicts. get a decent 3 views visual diff editor. I like Meld
git mergetool
7. Commit all the changes. Write an intelligible commit message
git commit -m "Decent commit message"
8. This will push the branch up on your repo.
git push origin my_feature_branch

Hope it helps.

 

Say hello to Loqu4cius

loqu4cius

Loqu4cius is a lightweight blog engine based on Django (not this Django), that runs on google app engine and it uses as backend CloudSQL, which is, as google put it, MySQL on the cloud.

A bit of history

Google appengine has the ability to run scalable app. So far it was possible to use django on it, given the fact Python was one of the two supported languages, however the back end was big table, which is not compatible with the classic RDBMS used by django.

This made impossible to use span relation and over, so the only usable bit of django were the templates, the URLs router but not the model…

Django-nonrel to the rescue.

A project called django-nonrel came to the rescue, and it created a compability layer between the NoSQL backend and the classic django ORM. Most of the span relationship were working, however some of the join, like the many2many were not available.

Fast forward to our time

Fast forward to today, google made it possible to have a classic RDBMS available, with the possibility to use all the ORM goodies, included django third app that can speed up and reuse the development.

So now Google-cloud to the rescue.

To check it out, I’ve came up with Loqu4cius.

It features a tag cloud that makes it be 2.0, is based on Twitter Bootstrap, and I’ve styled with some colors and the fonts (directly from google font), a search bar and the ability to enter rich text using ckeditor. The comments are integrated using disqus, that is the way to go right now.

The code is on GitHub with a quick readme, for any question the comments are here :).

Some thoughts about the development

Google appengine comes with some limitation, but with the possibility to add third parties libraries it is possible to re-use a lot of the django apps already available. (Let’s agree on terminology: app –> a single application that does one thing, for example it manages the tags, project –> a collection of all the apps and related files that runs the entire site.)

My strategy is to create a virtualenv and than copy all the necessary modules into the lib folder. This gives me the ability to install a package with

pip install package_name

and all the dependencies very easily. After that it’s a matter or using the apps and make it work pretty nice.

CSS writing

I like to use less to write CSS, but I don’t want to have a client compilation of the less file, and I want only to serve CSS in production, therefore I use two helper to get the job done.

First I use a python script that finds all the less file and compiles them into css, calling the lessc compiler.

However I don’t want everytime that I write a new bit of the less file, to call the script myself, so I use watchdog to call the script everytime the less file gets saved.

It would be nice to have a tool that can launch both the development server and this script in one go, and it actually doable. It’s called honcho and it accepts a classic Procfile.

For example for loqu4cius this is the Procfile.Dev

web: ./serve.sh
less: watchmedo shell-command --patterns="*.less" --command='./scripts/build_less.py "${watch_src_path}" ' static/less/

launching it with honcho -f Procfile.dev start makes sure to launch the development server, and to recompile and move the file to the collectstatic folder as required in one go, so you can focus on just developing.

Last but not least, I’ve created a quick release script, called release_site.py, which:

  • increases the app.yaml version of the site
  • performs the syncdb in production
  • uploads the site using appcfg.py,
  • commit the modifies app.yaml to the repo
  • tags the repo with the version number

so you can always now which commit refers to which version on googleappengine.

To figure out how to set up the enviroment in a way to have a streamlined development took me a bunch of days, and I’m eager to know other solutions to the same problems!