Monday, September 9, 2013

Quickly Combining Fasta Identification and Sequence Lines

When analyzing high-throughput DNA sequencing results, I sometimes want to select sequences with certain names from my fasta file (fasta is a standard sequencing file format).  There are a lot of ways to accomplish this, but here I want to outline one I found fast and easy.

The standard fasta files will always begin looking like the following:

My goal is often to use grep to pull out sequences that match certain sequence IDs.  To do this I need the sequence names and sequences to be in the same lines.  I found one cool fast solution (here; see 'examples'), which uses the paste function as follows:
paste -s -d '\t\n' myfasta.fa > output_file.txt

Here the -d flag was used to describe the characters that will replace each newline character (\n). This means that paste will replace the first \n with \t, the second with \n, etc, thereby merging every other line.  The resulting file looks like the following:

Now you can use grep or whatever to obtain only the sequences you want by selecting lines from 'output_file.txt' that have matching strings.  To get the fasta back to normal, simply use tr (translate) to replace all \t with \n.
tr "\t" "\n" < output_file.txt > final_fasta.fa

UPDATE (2013-09-21)
A couple of days ago, a friend of mine who reads the blog told me about another quick way to use grep on fasta files.  If your goal is to use grep to pull out sequences with a certain title, use the following:
grep -A 1 'word_of_interest' myfasta.fa > output_file.fa

The -A flag tells grep to pull the line that matches your query, along with the number of lines following the matching line, as specified by the number after -A.  Because you only want one line following the matching title (that line being the corresponding sequence), you specify one line after -A.

*Code formatting done using 'Format My Source Code For Blogging'.

No comments:

Post a Comment