— SevenBridgesGenomics (@SBGenomics) January 20, 2016
We’re out of the FOG! The Festival of Genomics has just concluded yesterday and it was a blast.
I’m very pleased to say that researchers and scientists are really looking for a solution to encode and make sure that complex pipeline can be replicated. Replicating scientific experiments is always a good idea, and having a way to describe them in a programmatic way, so they can be given to a computer directly, its a big step forward.
If you ever wrote a pipeline, you need that thing can get messy. In the best case scenario you have a bash script wrapper, that calls some executable with some parameters, and it may take some arguments.
If it is very bad, you may have some custom perl (or python) scripts that call some custom bash script that relay on some hard-coded paths which then launch executable that can be run on only certain version of software on a certain type of cluster, with some compiled options.
And, unbelievable as it sounds, the second option is very common, and the number of custom software and script involved is very high.
However, it does not matter how complicated your pipeline is, how obscure the program you use are, or how many ad-hoc script you are using, you can wrap all of them and express them and share it using the CWL, hinging on custom docker images.
For example, take a look at this classic bwa+gatk pipeline for calling variant (you may have to make a user on the platform to see it. Do not worry, it is free). Even with more than 20 steps, all the software, parameters and computation environment can be tracked and most importantly reproduced.
Any software can be ported on this language and expressed, the only requirement is that you can run it on a linux environment, so you can dockerize it.
Have a go, we may get over these cowboys day, and start to reprouce results down to the byte!