Many virus genomes are circular, like the olympic rings.
When analyzing virus metagenomic data, we often find it helpful to identify contigs that represent complete circular genomes [1,2]. In addition to offering biological information, this is used as a quality control technique to evaluate whether the sequencing efforts were robust enough to allow for complete genome assembly. This approach has the advantage of reference independence because it does not require aligning reads to a reference genome to evaluate sequence completion.
Because there is no end or start to a circular genome, assembled circular contigs contain sequence repeats in which the whole contig begins to repeat after the whole genome has been sequenced. This trait is used to detect circular genomes by "closing" the contigs by identifying the repeated genome signature. This can be done by aligning the contig nucleotide sequence to itself to "close the circle", as represented in the figure below.
|A linear contig representing a circular virus can be closed by detecting|
sequence similarity at each end.
Because this approach is primarily implemented as "custom in-house scripts", it is hard to actually find good, freely available resources without hunting them down from their authors. In the interest of adding to the valuable open-source virome analysis resources available online, I wrote out a script that detects circular virus contigs. The script is titled ccontigs.jl and is available on the GitHub ccontigs repository. See the documentation there for details.
The first question you might be asking yourself is what the ".jl" file extension means? What language is that? The .jl extension means that this script was written in the Julia programming language that I am liking more and more for bioinformatics. I originally tried writing the program in Python using some BioPython tools but I found the pairwise alignment tool was quite slow and resource (memory) intensive. I had good experiences before with Julia before, especially with regards to performance, so I rewrote the script in Julia and tried it out. The Julia script drastically outperformed the Python version so I stuck with it. The downside is that you need to install Julia on your computer/server, but this is pretty easy with instructions found here.
As I was validating the script I noticed an important caveat to this approach that I hadn't really seen mentioned in the literature. Some linear virus genomes actually contain a repeat of the beginning of their genome again at the end of the genome (e.g. Staphylococcus phage MSA6). This means that a sequence similarity approach would "close" the contig as a circle even though it's a linear genome. Is this a problem? The answer depends on your question.
If you are claiming a contig truly represents a completed circular genome, this "closing" method alone is somewhat insufficient and will need to be supplemented with a different approach. The method will however provide strong support for using this as a QC measure to support sequencing efforts as representing a large fraction of a virome. Even if the genome is linear, "closing" it still provides strong evidence that you sequenced enough to cover the entire genome.
Moving forward from this post, we now have an efficient, open-source tool for detecting circular contigs. We are also aware of the caveat that some linear genomes may be mis-annotated as representing circular genomes, but the impact of this caveat varies with the experimental question being asked.
As always, please leave any questions, comments, or concerns in the comment section below, or reach out through Twitter or email. I am always happy to get feedback and help out other virome researchers.
And yes, I know the metaphor in the first figure is a bit of a stretch. :)
2. Manrique, P., Bolduc, B., Walk, S., van der Oost, J., de Vos, W., & Young, M. (2016). Healthy human gut phageome Proceedings of the National Academy of Sciences, 113 (37), 10400-10405 DOI: 10.1073/pnas.1601060113