Sunday, August 24, 2014

A Microbiome Analysis Toolkit and "Block Fasta" Formatting

I write a lot of scripts in my day-to-day sequence analysis of microbiome data.  While a lot of these are a bit project specific, some of these could be useful for others in their sequence analysis projects.  A while back I posted about a script for formatting Qiime output files for input into the Lefse analysis toolkit, but now I am thinking it would be worth adding more.  Therefore, I changed the "Lefse formatting" repository to be a more general "microbiome sequence analysis toolkit" repository.  This seems like a nice place to periodically add scripts for easy use by others.  To get this new repo started, I added a new script for removing "block fasta" formatting from fasta sequence files.  It's relatively simple, but I think it's also pretty useful.

As I point out in the README, sequence fasta files are sometimes reported in "block fasta" format, meaning the sequence contains newlines.  An example is as follows:

$test_block.fasta

>Sequence_1
TATGCTGAGTCAGTCTGCAGTCAGTACGTCAGTCAGTCAT
TGCAGTCATTGACGGTCAGACGACTGCAGTCATCAGTA
>Sequence_2
CAGCAGTCAGTCATCATGACGTCAGTCAGTCAGTCAGTCA
GTCAGTCAGTCAGACGCA
>Sequence_3
GACGTCAGTACTGCAGTCAGACGTCATCGTCAGTCAGTCA
GTCATATACTCAGCGTCTATGACCGCAGTCAGTC

This makes the sequences easier for humans to read, but can complicate downstream analyses.  The simple script remove_block_fasta_format.pl will remove the newlines to generate a fasta sequence file like the following:

$test_no_block.fasta

>Sequence_1
TATGCTGAGTCAGTCTGCAGTCAGTACGTCAGTCAGTCATTGCAGTCATTGACGGTCAGACGACTGCAGTCATCAGTA
>Sequence_2
CAGCAGTCAGTCATCATGACGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGACGCA
>Sequence_3
GACGTCAGTACTGCAGTCAGACGTCATCGTCAGTCAGTCAGTCATATACTCAGCGTCTATGACCGCAGTCAGTC

To run this script, simply write the input and output file names after the name of the script you are calling in perl.  This script will work with both DNA/RNA sequences, as well as amino acid protein sequences.  And of course, a big shoutout to Qi Zhang in our lab, who helped me clean this script up and get it running more efficiently.

And that's all there is to it.  The script itself is pretty simple, but now it's available so you don't have to write it.  And this script is in perl, so it will run pretty fast.  Feel free to check out the repository on GitHub by following this link.  And of course, feel free to leave a comment below or shoot me an email.

Happy coding friends!



*Code formatting done using 'Format My Source Code For Blogging'.
*Image modified from this source

1 comment: