Sunday, June 26, 2016

The Up-And-Coming Bioinformatics Language: A First Look At Julia

Programming is a dynamic field that transitions from one language to another over the years. A classic example is the transition to Perl, which then transitioned into Python. The R language has also exploded in recent years, and all of these languages are used heavily in bioinformatics. Instead of focusing on the current state of bioinformatics, I want to focus this post on where we could be going in the future. More specifically, I want to discuss an up-and-coming programming language named Julia, which has potential for use in bioinformatics.


Julia is a new language that first appeared in 2012 and has been gaining attention ever since. The creators have focused on creating an efficient and fast language that is also relatively easy to use. Because people are talking more about it each day, and because I think it shows exceptional promise, I wanted to try it out for myself.

The Benchmarking

I was a little bummed when I saw their homepage benchmarking failed to include Perl, my goto language for a lot of the data munging associated with bioinformatics. Perl is also lightening fast for a scripting language, which makes it handy. I decided I would familiarize myself with the Julia language by setting up some basic benchmarking.

To get a feel for Julia's speed, I decided to recreate a Perl script that I use to calculate the median length of sequences in a fasta file. I downloaded Julia from the Julia website, installed it on my computer, and rewrote the Perl script in Julia. In total this took me about 1-1.5 hours, which highlights the ease of writing in Julia. It really took no time at all before I was writing a decent Julia script. I had never used the language before, but it is familiar to any Python or R user.

Once I had the two scripts, I ran them on the same example fasta file and compared the execution time required for both. I got the following results.

Comparison of Perl and Julia speeds for calculating the median sequence lengths in
an increasingly larger fasta file. Code is found here.

So the Perl script clearly ran faster than the Julia script, and both increased in time at about the same rate as I added sequences. So what can we say from these results? I would conclude that although Julia is fast, it still can't beat Perl for parsing data and making quick calculations. Of course this comes with the caveat that I have very little experience writing in Julia and could have written it poorly (I did try to make it efficient to give it a fair chance though). I also only tested the two on relatively small files, and the results may be different for very large files. Regardless, I still think this is informative.

Check out the associated data and code on the JuliaPerlBenchmark GitHub page.

Julia Pros


  • After spending some time with the Julia language, I really liked the familiarity of the syntax and data structures. Anybody with exposure to Python, R, or any similar high-level scripting/programming language will easily pickup Julia in about an hour or two. 
  • I like that Julia seems to be a bit of a hybrid between R and Python. It seems like it could be really good for bioinformatics by allowing easy data formatting, analysis, and presentation in one cohesive and fast language environment.
  • Although it was a little slower than Perl for parsing sequencing data files, Julia is still a fast language and I think this will draw more and more bioinformaticians to use it.
  • Finally, Julia allows for easy integration with C, which I think will help with future development.


Benchmarking results provided on the Julia homepage.

Julia Cons


  • Although I like Julia, there are certainly some problems that will prevent me from switching over right now. The biggest issue is that it simply does not have the support and infrastructure that a language like Python or R has. Julia is still up-and-comming and the community is not at the same level as the R, Python, or Perl communities. I expect it will pickup in the coming years, but for now it just makes sense (for me) to work in the more developed communities of R, Python, and Perl.
  • Although Julia is fast, it still can't beat my simple and fast Perl scripting. Until it beats Perl performance in data formatting and management, I honestly won't have a strong incentive to make the move over to Julia heavy scripting.


Final Thoughts

Julia is a promising and exciting new programming language that I think we will hear more about in the next few years. The community is small and there is less support compared to Python and R, but that could (and probably will) change over time. The general feeling I got for Julia was that it was a combination of Python and R that offered me the best of each in one language. That, in addition to the speed advantages over R and Python, could allow Julia to replace Python and R as major programming languages in the near future. I really do think it is reasonable to expect Julia to be the bioinformatics language-of-choice in the next ten to fifteen years. Ultimately though only time will tell.

Any thoughts, comments, or concerns? Any bugs in my code or errors in my interpretations? Let me know in the comments below. You are also always welcome to reach out on Twitter or by email. I always love to hear from Prophage readers.

Update

I have been getting incredible feedback on this blog post and I wanted to update the readers with what I have learned, and how the data has improved. Thanks to the readers in the comments below, as well as on the GitHub repository, we have addressed two issues with the benchmark.

  1. The script I wrote needed to be written more efficiently. Ismael rewrote the script to run more efficiently, and also provided a solid explanation of what they did.
  2. As you can see in the comments, the problem with this test is that Julia is taking time to start and compile the code. The time required to get started is considerably greater for Julia, which is the biggest reason for why Perl appears to perform better. Given this information, you might predict that Julia could outperform Perl on larger file sizes where the startup time become negligible. I quickly bolstered the size of my file to about 500MB (from 30MB) and reran the benchmark. Wouldn't you know it, Julia begins to outperform Perl at larger file sizes, which is awesome. The updated results are below.

Updated comparison of Perl and Julia speeds for calculating the median sequence lengths in
an increasingly larger fasta file. Larger file than figure above. Code is found here.


So what what can we take away from this? It turns out that while Julia startup takes longer, it is blazing fast and actually outperforms Perl when using larger but reasonable files. With this new and more correct knowledge, I am happy to say that I am even more excited about Julia and think that it has a place in bioinformatics. Speed for me is a big thing, so I can see incorporating this into my own work.

I finally want to thank all of the readers who contributed to this blog post. I love that people were able to help make this little piece of data accurate and fair, and I feel like we all benefitted from the improved results. Thank you so much and please feel free to continue commenting.


5 comments:

  1. Hi, I'm one of the maintainers and coders over at BioJulia, our main flagship package is called Bio.jl, and in that package a lot of effort has been invested in our consistent IO interface. For files like FASTA that are 'regular', we create Ragel specifications for file formats, and this results in automatically generated parsers that in our experience are very fast. We also have sequence types that are more efficient than strings. If you're interested we'd love to know how your benchmark does with what we have created, and how it compares to the string processing based julia script you created.

    ReplyDelete
  2. Hey Ben, thanks for the information! I definitely want to try this out. When I get a chance I'll try implementing it in my benchmarking repo on GitHub and followup with you on how it looks. :)

    https://github.com/Microbiology/JuliaPerlBenchmark

    ReplyDelete
  3. Hi, this is an interesting post, thanks. Thought I'd just mention something about your benchmarks. At the minute you use the bash `time` command to measure execution time. This makes sense for Perl, which has minimal start up time, hence the plots above going to 0 time with 0 sequences.

    However for Julia, what this benchmark really measures is the time for starting up Julia, and also compiling the code. This is why a similar linear trend is seen, plus a large constant factor of ~200 ms.

    If you wanted to measure the speed of the algorithm execution once everything is loaded and compiled, you might be better off using the internal commands in Julia, e.g. the `@time` macro.

    ReplyDelete
  4. Yeah this is a very good point! So I was originally thinking about the benchmark in terms of how long "everything" takes to run, which is why I was fine with `time`. But this is an important consideration for thinking about scaling the benchmark to larger files. I bet that once this is run on large enough files where the startup time become negligible, the story will be different.

    Thanks for the input! :)

    ReplyDelete
  5. I went back and reran the benchmark on larger files and Julia indeed outperforms Perl at larger file sizes. Thanks for the comment and helping the rest of us avoid making unfair conclusions on incomplete data! :) I also updated the blog content with this new information.

    ReplyDelete