Thursday, September 18, 2014

Microbiome Analysis: Average Sequence Lengths and Looping with xargs

Sometimes I want to easily calculate the average sequence length of our collection of sequences in either a fasta or fastq file.  To address this, I wrote up a couple of small perl scripts to quickly calculate the median sequence length of a fasta or fastq file.  You can find these perl scripts on GitHub in my "Microbiome_sequence_analysis_toolkit" repository.  The nice thing about this script is that it returns the median and file name to the standard output, which makes it easier to loop across many files and collect the results into a single summary file.

As an example, I would use xargs (in bash on a Linux OS) to loop through some fasta files in a directory and collect the resulting average sequence lengths into a single summary file.  To be more specific, I would first use the ls command to return all of the fasta file names as a list.

$ ls .

file1.fa
file2.fa
file3.fa

You can use the bash pipe to carry that list of file names to the next command.  You can use xargs to pass each file name from the list (one after another) to the perl script.

$ ls . | xargs -I {} perl calculate_fasta_median_length.pl {}

102    file1.fa
114    file2.fa
307    file3.fa

Now you have a summary list of the average sequence lengths for each file in the directory.  If you want to save the list as a summary file, just pass the list (in standard output) to a file.

$ ls . | xargs -I {} perl calculate_fasta_median_length.pl {} > ./summary_file.txt

And that's all there is to it.  You can use these perl scripts to get the average sequence length of a fasta or fastq file, and you can use it in a loop to get a summary of a group of files.  Get the scripts here.  And as always, please feel free to contact me in the comments below, or feel free to shoot me an email.

Happy coding friends!



*Code formatting done using 'Format My Source Code For Blogging'.


No comments:

Post a Comment