Example screenshot of open reading frame annotation within the Geneious program. |
As you are predicting your open reading frames with a glimmer workflow, you get a few different files with a few different formats. One of the main output file types you are going to get is the '.predict' file, in addition to the '.detail' file. The '.detail' file includes a lot of information about all of the predicted open reading frames, while the '.predict' file only includes the final open reading frame prediction information. The file may contain predictions from one or many genomes or genomic segments. The file itself can be broken into six parts which I outline below.
- Header: The first line is the genome identification, which is the same ID that the sequence had in the fasta file. Under the header is a set of five columns.
- Column 1: The name (ID) of the predicted open reading frame.
- Column 2: The sequence base number of the first base in the open reading frame. In other words, the starting location.
- Column 3: The sequence base number of the final base in the open reading frame (last base of the stop codon). In other words, the ending location.
- Column 4: The reading frame position.
- Column 5: The 'per-base raw' score of the predicted open reading frame.
An example of a '.predict' file is found below. This file contains various predicted open reading frames for three different genomic sequence fragments.
>1_
orf00001 522 689 +3 0.85
orf00004 5600 4083 -3 2.97
orf00006 8925 9050 +3 0.11
orf00007 10514 9444 -3 2.95
orf00008 10836 10961 +3 2.96
orf00016 16597 15113 -2 2.97
>2_
orf00001 4684 94 +2 2.91
orf00002 353 207 -3 1.19
orf00003 464 194 +2 2.92
orf00004 33 27 -3 1.11
>3_
orf00004 5600 4083 -3 2.97
orf00006 8925 9050 +3 0.11
orf00007 10514 9444 -3 2.95
orf00008 10836 10961 +3 2.96
Now this file format is fine for many downstream applications, but it won't get you very far in visualizing the ORFs on your sequences in programs including Geneious or the Integrated Genomics Viewer. A great standard file format to use to visualize predicted open reading frames on genomic sequences is the '.gff3' file format. This includes most of the same information as the '.predict' file, but it is formatted differently. This is a standard format, which means it will play nicely with more programs, and more people will be familiar with it.
The '.gff3' (or Generic Feature Format Version 3) file format is a common and robust format used to describe genomic features (primarily genes, predicted open reading frames, etc). Like the '.predict' format, '.ggf3' is broken into tab delimited columns, except '.gff3' does not use headers and consists of nine columns. The nine columns representations are found below.
The '.gff3' (or Generic Feature Format Version 3) file format is a common and robust format used to describe genomic features (primarily genes, predicted open reading frames, etc). Like the '.predict' format, '.ggf3' is broken into tab delimited columns, except '.gff3' does not use headers and consists of nine columns. The nine columns representations are found below.
- Column 1: (Seqid) The genome or genomic fragment that the predicted open reading frame belongs to.
- Column 2: (Source) The name of the algorithm, program, or workflow that generated the open reading frame.
- Column 3: (Type) The type of the feature described by the line, which is a gene in our case.
- Column 4: (Start) The sequence base number of the first base in the open reading frame. In other words, the starting location.
- Column 5: (End) The sequence base number of the final base in the open reading frame (last base of the stop codon). In other words, the ending location.
- Column 6: (Score) The score of the feature is poorly defined, but is often an e-value or other score associated with the feature. In my case, I used it to identify the predicted open reading frame, which worked well in my own downstream analyses, but you can use the 'per-base' raw score.
- Column 7: (Strand) A plus or minus to identify which strand the open reading frame is a part of.
- Column 8: (Phase) This number described the reading frame associated with the feature.
- Column 9: (Attributes) A list of more attributes associated with the open reading frame feature.
An example of a '.gff3' file can be found below. This was generated using my perl script (described below) and is derived from the '.predict' file above. Also note that this file is only for the open reading frames associated with sequence 2, and I made the text smaller so the entire line fits on the page.
2 GLIMMER gene 4684 94 orf00001 + 2 ID=orf00001; NOTE: Glimmer ORF prediction;
2 GLIMMER gene 353 207 orf00002 - 3 ID=orf00002; NOTE: Glimmer ORF prediction;
2 GLIMMER gene 464 194 orf00003 + 2 ID=orf00003; NOTE: Glimmer ORF prediction;
2 GLIMMER gene 33 27 orf00004 - 3 ID=orf00004; NOTE: Glimmer ORF prediction;
To end this post, I am including the perl script I used for the conversion from '.predict' to ',gff3'. I am also storing it with the rest of my microbiome analysis tools on my Github account. To use the script, call the script itself with perl, followed by the input file, the sequence ID you want to extract information for, and the output file.
Using the script:
perl GlimmerPredict2Gff3.pl ~/TestIn.predict 2 ~/TestOut.gff3
#!/usr/local/bin/perl -w # GlimmerPredict2Gff3.pl # Geoffrey Hannigan # Elizabeth Grice Lab # University of Pennsylvania # This script will take in a .predict file from glimmer and will convert it to gff3 format. # Set use use strict; use warnings; # Set files to scalar variables my $usage = "Usage: perl $0 <INFILE> <CONTIGID> <OUTFILE>"; my $infile = shift or die $usage; my $contigid = shift or die $usage; my $outfile = shift or die $usage; open(IN, "<$infile") || die "Unable to open $infile: $!"; open(OUT, ">$outfile") || die "Unable to write to $outfile: $!"; # Confirm the contig identification print "Contig ID is $contigid.\n"; # Store flag value as zero my $flag = 0; while(my $line = <IN>) { # Once you hit the contig block of ORF interest, append to flag and get going! if ($flag==0) { if ($line =~ /\>$contigid\_/) { ++$flag; next; } else { next; } # Now that the flag is appended, deal with the ORF lines for the contig of interest } if ($flag==1) { if ($line =~ /\>/) { # Once you hit the end of the ORFs of interest, by hitting the next contig identifier, append the flag. ++$flag; next; } else { chomp $line; $line =~ s/\s+/\t/g; print OUT "$contigid\tGLIMMER\tgene\t"; print OUT "$2\t$3\t$1\t$4\t$5\tID=$1\; NOTE\: Glimmer ORF prediction\;\n" if $line =~ /^(\S+)\t(\S+)\t(\S+)\t(\S)(\S)\t(\S+)/; } # Once the flag is appended for the last time, kill the loop. We are done here. } if ($flag==2) { last; } } #Close out files and print completion note to STDOUT close(IN); close(OUT); print "Fin.\n";
Works Cited
- Formatting perl source code
- Formatting other source code
- More .predict file format information
- More .gff3 file format information
- Even more .gff3 file format information
- Geneious example source
No comments:
Post a Comment