Thursday, February 13, 2014

Microbial Biomarker Discovery and How to Properly Format Your Data (Lefse)

Biomarker discovery is a big part of medical research.  A biomarker is a clinical signal, like the presence of a gene in your genome or colonization of your lungs with a certain bacterial species, whose presence or absence indicates a disease state or predicts an increased risk for developing a disease state.  These play important roles in medicine because they can allow for disease diagnosis (ie. the physician can test for the disease biomarker) or provide a prediction of whether the patient is at a higher risk for developing a condition (ie. a gene that puts a patient at higher risk for developing diabetes).  There are some good programs out there for biomarker discovery, but one I particularly like is Lefse.

Lefse is an online program that was developed by the Huttenhower lab at Harvard.  Briefly, Lefse performs biomarker discovery by taking in categorized relative abundance features, such as microbial relative abundances which are categorized as diseased or non-diseased, performs some statistical analyses, and determines which features are statistically significant between categories and are therefore potential biomarkers.  This program is especially nice because it allows you to quickly visualize the biomarker potential of your dataset.  Read more about it in their paper, which I included in the works cited.

Lefse is overall easy to use, except for the formatting of the input data.  I definitely had a few issues when I was first trying to use the relative abundance tables I yielded after running some Qiime analyses.  To make my life easier, I wrote up a python script to quickly change my Qiime alpha diversity tables to a format that Lefse will read.  Additionally, to hopefully make some of the readers' lives easier, I am posting that script here and on GitHub.  Either download it from my Github account (Qiime_to_lefse_format), or copy it from here, paste and save it into a file named, and run it as 'python'.
#Geoffrey Hannigan
#University of Pennsylvania
#Laboratory of Elizabeth Grice
#Created: 2014-01-17
#NOTE: This is a script for formatting Qiime alpha diversity files and mapping files for use in lefse.

import os, sys, argparse

#I want this script to be easily usable like Qiime scripts so I am going to define arguments that can be entered from the cmd line

parser = argparse.ArgumentParser(description='Use this script for formatting relative abundance files, taken directly from Qiime and used with mapping files, for Lefse analysis.  Ultimately this will merge your relative abundance table to your mapping file, replace every case of ; with |, and transpose the file so that it works in Lefse.  WARNING that this script uses tmp folders like tmp1.txt so please make sure that your local directory does not include any files with this type of name.  Please also NOTE that you will likely have many columns of meta-data in your mapping file, and Lefse does not work well with these.  At this point you are going to have to delete the rows of meta-data that you do not want to use, so that it can be entered into Lefse.')

parser.add_argument('-i', '--input', metavar='file_name', type=argparse.FileType('r'), help='input relative abundance table from Qiime')
parser.add_argument('-m', '--mapfile', metavar='file_name', type=argparse.FileType('r'), help='mapping file')
parser.add_argument('-o', '--output', metavar='file_name', type=argparse.FileType('w'), help='output relative abundance table for Lefse')

args = parser.parse_args()

#Search for semicolon and replace with pipe, also replace empty spaces with underscores
f2 = open('./tmp1.txt', 'w')
for line in args.input:
 space_replace = line.replace(' ', '_')
 f2.write(space_replace.replace(';', '|'))


#Transpose the resulting file from above
f1 = open('./tmp2.txt', 'w')
with open('./tmp1.txt') as f:
 lis=[x.split() for x in f]

for x in zip(*lis):
 for y in x:


#Merge the mapping file information to the transposed relative abundance table based on having the same sample IDs
# This gives me a dictionary with all of the ID numbers correlating to the remainder of the row values.  See here for the reference I used to get started:
f_result = open('./tmp3.txt', 'w')
with open('./tmp2.txt', 'rb') as f1:
 f1_data = dict(line.split('\t',1) for line in f1)

for line in args.mapfile:
 key = row[0]
 map = f1_data[key]
 map2 = map.split('\t')
 map3 = row + map2


#Transpose the file back again once the mapping file has been merged to the relative abundance table
with open('./tmp3.txt') as f:
 lis=[x.split() for x in f]

for x in zip(*lis):
 for y in x:

#Have the operating system remove the tmp.txt files that were created while the script was running above.

This is nothing too fancy, but it makes life easier.  Basically it will take your relative abundance table straight from Qiime, make a few substitutions, and merge it with your mapping file (so your metadata is included in the Lefse input).  All you have to do is make sure you remove the rows and columns you are not going to use in your Lefse analysis because it does not deal well with extra information.  After this you should be ready to import your data table into Lefse.

As always, my goal is to provide you with some cool resources and further reading.  If you want to work on this code, improve it, or whatever, find it on my Github at Qiime_to_lefse_format.  If you want to chat about this stuff, ask questions, or whatever, please reach out and contact me.  I am new to Python and learning, but I would love to learn from you or pass on what information I have.  I know more about the biology and medicine, so I would love to talk about that too.


Nicola Segata, Jacques Izard, Levi Waldron, Dirk Gevers, Larisa Miropolsky, Wendy S Garrett, & Curtis Huttenhower (2011). Metagenomic biomarker discovery and explanation Genome Biol DOI: 10.1186/gb-2011-12-6-r60

*Code formatting:
*Stock photo from source


  1. Hi Geoffrey,

    Thank you for this super helpful post, and for the code! I was wondering if you could clarify what QIIME output table this script uses as input. This post and the readme file for your script both say to use an 'alpha diversity' table, but these tables generally don't have relative abundance of OTUs per sample. I think maybe you mean an OTU table created by the pick_otus command, which have relative abundance of OTUs per sample. These are in BIOM format and I'm not sure if your script is designed for that. The other option is one of the .txt files created by the summarize_taxa command, but these have sample ids in the top row and taxa identifiers in the first column, which contradicts the instructions in your readme file.

    Any clarification would be very helpful! I posted this here as a comment in case anyone else has similar questions.


  2. Hi Nastassia,

    Thanks for reading the blog and reaching out. It looks like the problem is some typos in the README. You are right that this should be relative abundance and not alpha diversity.

    BIOM format cannot be used with this script. The input should be a tab delimited relative abundance table from summarize_taxa. Having the sampleIDs as the first row and the taxa as the first column should be used, so I will have to fix the README.

    Have you tried running the .txt tab delimited file from summarize_taxa, as you mentioned, with taxa in the first column, and the sample IDs on top? Let me know if that works.

    Thanks again for letting me know about these issues. I will fix the README, and will also include an example input file to make this clearer.

  3. Hi Greg,

    Thanks for the quick response! I tried the following command, where the input file is a tab-delimited .txt output file from the summarize_taxa command and the mapping file has been edited to include only SampleID and Metadata columns:

    python $PATH/ -i otu_table_mc2_w_tax_L2.txt -m BZA3_map_for_Lefse.txt -o lefse/table_for_lefse.txt

    When I tried this, I got the following error:
    Traceback (most recent call last):
    File "$PATH/", line 49, in
    map = f1_data[key]
    KeyError: 'SampleID'

    Maybe it doesn't know how to correlate the columns in the mapping file with the samples in the tab-delimited file? Not really sure. I'm afraid I don't really know how to code so I can't go through your script manually.


  4. Hi Nastassia,

    No problem, glad to help! :) My first question is whether you included an output file as well, which needs a -o flag. Sorry about this not being in the README, but it will show up if you type:

    python -h

    My first guess from this error would be that the sample ID column names are different in the mapping file and the relative abundance file. These columns should have the same name.

    Would you mind sending me an example subset of your two input files (like the first five rows of the first five columns of each file) so that I can try trouble shooting it on my end? You can email them to me if you would like at

    Thanks for your patience! :)

  5. Just wanted to update the comment thread for other readers. We were able to fix the issues, and the script is updated on GitHub to reflect those fixes. Everything is working now better than ever. :)

  6. For disease research, biomarker generally refers to a biochemical indicator for objective measurement and evaluation of certain characteristics of general physiological or pathological or therapeutic procedure. The biological process of the body can be measured through biomarker testing.
    It is of great importance for the research of biomarkers.