Prophage

We Are Moving!

2017-11-18T19:47:00.000-05:00

After a couple awesome years and over 100,000 views, Prophage is moving to a new site. Check out the first post on the new site here. Follow the link below for the new site.

New Prophage Blog Site

Improving Your Skill Set: Tips for Learning New Programming Languages

2017-06-17T09:49:00.000-04:00

I spend a considerable amount of my time with scientists who are staring to learn to code (either IRL or online), often in the hopes that it will open future research and career doors. One of the major barriers I have found in my own experience, as well as observed with others, is learning and implementing new programming languages after I already learned one. If we are honest, learning a programming language is very challenging but also incredibly rewarding. So do we really want to go through that again with a new language, and will the new language open as many analytical doors as the first? This week I was to discuss the process of learning new programming languages, and offer you some tips on learning a new language, if that is something that interests you.

Why Learn a New Programming Language?

Once you know how to program with a language like Python or R, why bother learning other languages? This is a logical question, and one answer is that different languages can offer you very different strengths and weaknesses. Some languages are faster than others, some implement more responsible memory usages, some are just easier to read and write (we're looking at you, Perl), and some have stronger communities to support your applications (e.g. many scientific applications are supported by the R community). Being familiar, if not proficient, in multiple languages offers the ability to take advantage of the strengths of multiple different languages, and apply the tools that are best fit for the job.

Another reason for learning new languages is the fact that most languages are not around forever, and eventually become replaced. This makes "keeping up" a very valuable skill. Yet another good reason for learning new languages is that the exposure to different structures allows you to think about all of your code and data in new and challenging ways. This is a great exercise for becoming a better all-around scientist.

Don't Sell Yourself Short

Through various conversations and my own experiences, I have learned that it is really easy for us to sell ourselves short. We see a new and unfamiliar language and think, "I can't learn how to use that; I barely learned the language I know now." The fact is that the first language is the hardest, and you will be surprised how much faster you will pickup new languages. So we really need to "go for it" and check out the new languages we think will be helpful. The worst thing we can do is disqualify ourselves before giving it a try.

Skills For Learning New Languages (In 30 Minutes)

Last week I gave a presentation to my lab about this topic. The main goal being to illustrate how each of us can learn new programming languages faster and more effectively than we give ourselves credit for. To this end, I proposed that we break into small groups (although you can do this by yourself as well) and solve the following simplified Fizz Buzz test.

Write a program that prints the numbers 1 to 100. Print the word "FIZZ" next to every number divisible by 7, and the word "BUZZ" by every number that is not divisible by 7.

The trick here is that we solved the problem using Julia, a language that nobody else has used in our lab (if you want to do this yourself, pick a different language if you are familiar with Julia). I mentioned to the group that, when I start familiarizing myself with a new language to solve a specific task, I begin by breaking the problem up into smaller, "Google-able" problems. These can then be solved and put together to form the overall code. Smaller problems within our Fizz Buzz test include:

How do I print words and numbers?
How do I create a string of numeric ranges (1,2,3,...)?
How do I loop tasks?
How do I perform conditional evaluations?

After going through this, we broke up into groups and everyone was able to solve the problem within 30 minutes. When we came back together, we discussed our solutions while reflecting on the following points:

How did we break up the tasks?
What was the most challenging part of the problem?
What does the solution look like (show the code to the rest of the lab)?

And there you have it. After about a 45 minute meeting, everyone in the room was able to solve an analytical task using a totally new language. I definitely encourage you to also give it a try. If you run into problems, or have more questions about how you can use this approach as a teaching tool, reach out in the comments below, Twitter, or email (my contact info to the right side of the page). As always, please also reach out if you have any other questions, comments, or concerns. I always love hearing from people.

I also wanted to include a note about the state of this blog. You may have noticed that posting has become less frequent lately. I have a lot of other projects going on right now, which are awesome but also mean I don't have a ton of time to devote to blogging. I look forward to getting back into a more regular routine, but for now the posts are going to be more spread out. Thanks for reading!

A Primer on Downloading Sequencing Data from MG-RAST & the SRA

2017-05-08T20:07:00.000-04:00

One of the best set of resources we have for bioinformatics, and especially microbiome research, are the extensive and freely available DNA sequence archives. For the past few years, most studies have been (and in most cases required to) archiving their relevant sequence datasets so that they are freely available to the public and other researchers. This is becoming an increasingly valuable resource for data mining and meta-analyses now that we have about a decade of archiving behind us. Just as these datasets can be highly valuable research tools, they can also be particularly difficult resources to download and prepare for analysis. I have been meaning to get to this for a while, so this week I want to go through an introduction to downloading these datasets. My goal is to equip you to easily get the sequence sets onto your own computer and start your own analysis.

The Sequence Read Archive (SRA)

One of the largest (if not the largest) sequence dataset archives available to the public is the United States National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). This sequence archive has years of DNA sequencing studies readily available, but getting the reads can be a little bit of a challenge. They do have instructions (and other tools for downloading) in their documentation, but to make things easier, we will go through it here while including some custom scripts that you can use.

An easy way to get SRA datasets using command line tools is downloading the data from their ftp (no worries if you don't know what that is; it's just a site to download data from). As long as you are downloading a small-ish dataset, the wget tool works great. A nice subroutine you can use is as follows.

DownloadFromSRA () {
 line="${1}"
 echo Processing SRA Accession Number "${line}"
 mkdir ./data/${Output}/"${line}"
 shorterLine=${line:0:3}
 shortLine=${line:0:6}
 echo Looking for ${shorterLine} with ${shortLine}
 # Recursively download the contents of the 
 wget -r --no-parent -A "*" ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/${shorterLine}/${shortLine}/${line}/
 mv ./ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/${shorterLine}/${shortLine}/${line}/*/*.sra ./data/${Output}/"${line}"
 rm -r ./ftp-trace.ncbi.nih.gov
}

export -f DownloadFromSRA

If you copy and paste this into your command line (Linux/Mac), you can just type the subroutine name "DownloadFromSRA", followed by the project ID that you want to use, and it will download all of the samples for you. If you are using a Mac, be sure to install wget using something like Homebrew (which I highly suggest for downloading tools in general). The files you get will be in the SRA format, so you have to remember to convert them to fastq format using their custom tools.

You don't have to be a superhero hacker to get DNA data from public archives.

The Metagenomics RAST Server (MG-RAST)

Although used less than the SRA, the Metagenomics RAST Server (MG-RAST) is another one of the major archives available for free public use. Although MG-RAST is a nice sequence repository, it is unfortunately more difficult to use than the SRA (for downloading sequences at least). The key to downloading MG-RAST data with command line tools is honestly complicated at first, and sort of hidden in the documentation. Again, to make things easier, we can use some custom scripts to make things happen.

The trick to getting the MG-RAST sequence files using a project ID is that you have to first download the project metadata, and then use the parsed metadata information to download the actual files (this is done in the second loop below. The actual URL to use with their API is also kind of confusing, but once you get it you are ready to go.

DownloadFromMGRAST () {
 line="${1}"
 echo Processing MG-RAST Accession Number "${line}"
 mkdir -p ./data/"${line}"
 # Download the raw information for the metagenomic run from MG-RAST
 wget -O ./data/"${line}"/tmpout.txt "http://api.metagenomics.anl.gov/1/project/mgp${line}?verbosity=full"
 # Pasre the raw metagenome information for indv sample IDs
 sed 's/metagenome_id\"\:\"/\nmgm/g' ./data/"${line}"/tmpout.txt \
  | sed 's/\".*//' \
  | grep mgm \
  > ./data/"${line}"/SampleIDs.tsv
 # Get rid of the raw metagenome information now that we are done with it
 rm ./data/"${line}"/tmpout.txt
 # Now loop through all of the accession numbers from the metagenome library
 while read acc; do
  echo Loading MG-RAST Sample ID is "${acc}"
  # file=050.1 means the raw input that the author meant to archive
  wget -O ./data/"${line}"/"${acc}".fa "http://api.metagenomics.anl.gov/1/download/${acc}?file=050.1"
 done < ./data/"${line}"/SampleIDs.tsv
 # Get rid of the sample list file
 rm ./data/"${line}"/SampleIDs.tsv
}

export -f DownloadFromMGRAST

These files will be in the fasta format instead of the sra format you get from the SRA. Also note that this uses GNU sed, which is not installed on Mac computers by default (Mac has a different version of sed. I know, it's kind of annoying). So make sure that, if you are running this on a Mac, install GNU sed using Homebrew again.

To give it a try, copy and paste this subroutine into your command line, and then write the project ID, like below.

DownloadFromMGRAST 4843

Conclusions

So there you have it. A very brief introduction to downloading SRA and MG-RAST datasets, with an emphasis on providing you the tools to do it yourself. Go ahead and give it a try. Let me know how it works, and if you run into problems, feel free to reach out with questions. And of course, please let me know if you have any questions, comments, or concerns!

Finally, thanks for reading! If you are a frequent reader, you might have noticed that my posts have been less frequent lately. I apologize for that. This has been an eventful year, which is great in general but bad for keeping up with the blog. As usual, it means I have some other exciting projects going on, and I am excited to share those experiences on here later. So for now the posts will be less frequent, but I look forward to getting back in a more frequent writing groove in the near future.

Publication Alert: High Nucleotide Resolution Study of the Skin Virome

2017-04-08T09:53:00.000-04:00

We identified diversity generating retroelements as a
potential mechanism driving targeted genomic diversity.

A few weeks ago some colleagues and myself published a new manuscript looking at the diversity of the human skin virome. In our previous previous work, we evaluated the diversity of viruses on the skin. Other groups have looked at virus diversity at other body sites including the gut, lungs, and oral cavity. Our new paper focused on the diversity within viruses on the skin. It provided initial insight into the genomic variability associated with major viruses in the skin virome. In other words, it was a "high resolution" study of the virome.

One of the highlights of the manuscript was identifying numerous hyper-variable loci within the skin virus genomes that we investigated. We did this using a SNP geometric distribution approach instead of a sliding window because it allowed us to establish regions that were more variable than would be expected by random chance, and did not require us to arbitrarily establish a window size for the loci. The loci identified using this approach were associated with stronger evolutionary pressure than their adjacent regions, suggesting they are functionally important. We followed up with this, but I will let you get the details from the manuscript.

A methodological highlight was the validation of our findings using an existing dataset from a different lab. We performed our analytical workflow a second time using another skin metagenomic dataset from a different skin microbiome lab. Even though the second dataset did not undergo virus purification, we were still able to pull out enough viral reads to perform our targeted analysis. We replicated our findings in the second dataset, thereby supporting the strength and biological ubiquity of our findings. My hope is that more groups will perform this type of validation in their future studies, especially since there is so much archived data just waiting to be utilized.

The challenge with this study was writing the analytical tools that I needed to answer our questions. In the end, I built a lot of tools that allowed me to answer evolutionary and functional virome questions. I think these are pretty easy to use, and made them freely available on GitHub. If you are interested in performing similar evolutionary analyses on your virome datasets, check the code out here and let me know if you have any questions.

In the end, I think this is a pretty cool study and I really enjoyed working on it. If this summary sounds interesting, I suggest you check out the paper. It is freely available online and easy to download.

As always, if you have any questions, comments, or concerns, please let me know in the comments section below, shoot me an email, or find me on twitter (links are to the right). I always love to hear from readers!

References

Hannigan GD, Zheng Q, Meisel JS, Minot SS, Bushman FD, & Grice EA (2017). Evolutionary and functional implications of hypervariable loci within the skin virome. PeerJ, 5 PMID: 28194314

Correlations In Random Genomic Data: A Simple Biology Pitfall

2017-03-12T14:17:00.000-04:00

Wow it has been a long time since we have had a post on here! As always, that means other projects are in the works and the blog has taken a bit of a back seat, but we are back and ready to talk science. This week I wanted to get to a topic I have been meaning to get to for a while. If you are a frequent reader, you know that every now and then I like to go over some basic statistics topics that cause confusion among biologists, as well as scientists in other fields. This week I want to cover a common statistical pitfall, with the hopes that it will prevent readers from making simple mistakes. The topic for this post will be obtaining statistically significant correlations from random gene expression data.

To kick things off, let's imagine I have a gene expression dataset. More specifically, I have expression data for gene 1 and gene 2, as well as a housekeeping gene (these genes are usually used for experimental controls). Ultimately I want to compare expression of gene 1 and gene 2 in 50 different people, with my hypothesis being that the expression of these genes are positively correlated with each other (when gene 1 is highly expressed, gene 2 is also highly expressed).

We could answer this type of question with real data, but what if we use random data? In R (an awesome statistical programming language), I can generate a random set of 50 numbers, representing the gene expression of gene 1 and gene 2 among 50 different people. Please note that for consistency, I set the random seed so that the code always returns the same result. Also note that the visualization is done using log values to make it clearer, but the correlations are the raw values, and not log transformed.

# Load the ggplot2 library
library(ggplot2)
# Set the random seed
set.seed(1234)
# Create two sets of 50 random numbers
x <- sample(1:10000, 50)
y <- sample(1:10000, 50)
# Put them together in a data frame
df <- data.frame(gene1 = x, gene2 = y)
# Plot the results
qplot(log10(gene1), log10(gene2), data = df) + theme_classic()
cor.test(df$gene1, df$gene2)

The resulting correlation had a p-value = 0.59 and a correlation (r) of 0.077 (using Pearson correlation coefficient). We therefore generated a set of random expression values for genes 1 and 2, and when we plotted them against each other, we got a random distribution of points. This set of gene expressions was not correlated.

But if this was actually an experimental result, we might think "of course there is no correlation, we forgot to correct our results with our housekeeping gene control". For readers unfamiliar with this practice, gene expression data can sometimes be skewed by differences in loading the machine, preparing samples, etc. This can mean some samples look more abundant due to experimental variability instead of biological variability. To correct for this, we can use "housekeeping genes", which are genes that we expect to be expressed about the same amount in each person.

To correct our data for experimental variability, we may divide each of our gene 1 and gene 2 expression values by the housekeeping gene value from that sample. Therefore, if sample 1 had twice as much material loaded as sample 2, we would divide it by twice as much housekeeping gene, and the result would be values over approximately the same housekeeping gene expression. In our example, we can make this correction using a third set of randomly generated numbers.

# Building off of the previous code
# Create random housekeeping gene expression data
z <- sample(1:10000, 50)
df2 <- data.frame(gene1 = x/z, gene2 = y/z)
qplot(log10(gene1), log10(gene2), data = df2) +
theme_classic() +
geom_smooth(method=lm, se = FALSE)
cor.test(df2$gene1, df2$gene2)

With this correction using more random data, we would expect to see another lack of correlation. On the contrary, the resulting correlation had a p-value = 3.4e-9 and a correlation (r) of 0.72. Even though we were using entirely random data, we obtained a much higher and statistically significant correlation between expression of gene 1 and 2. Here we can see that we did something terribly wrong to get this result, but if we had done this on a biological dataset, we might think that we had found a great result and may even push to publish it.

So what did we do wrong? Ultimately it was that "correction" that hurt us. When we divided each person's gene 1 and gene 2 expression value by the same "housekeeping" value for that person, we were introducing a common transformation within each sample that made gene 1 and 2 expression more similar to each other, within each person. This is why it is problematic when we apply functions (such as division) to each sample individually and then perform a correlation of those samples. This is also why we have to be careful in our correlation analyses and think carefully about how we are dealing with our data, and what correlations we might be introducing by mistake.

Of course this is just a simple example and the principle could apply to many scenarios. The main point here is to outline a potential way we might skew our results without knowing it. Overall I hope this summary was informative and will help in thinking about analyses in future experiments.

Any questions, comments, or concerns? Always feel welcome to reach out in the comment section below, or reach out to me on Twitter (my Twitter link is on the right).

A Model for Phage Communication and the Implications for the Human Microbiome

2017-01-28T19:52:00.000-05:00

The research group prepared
two types of media to test phage
infection efficacy.

Well we took a bit of a break these past couple of weeks, but we are back for the new year! Welcome to the Prophage blog 2017! The year has actually been off to a good start, with a lot of interesting papers being published this January. This week I want to kick things off by covering a very cool 2017 study by Erez et al that described an new and interesting mechanism by which bacteriophages communicate using their bacterial hosts. This really is a well written and elegant study that I highly suggest you read. In this post, I want us to cover the highlights of the study, and then discuss what this will mean for future research endeavors.

The research group led by Erez et al began their work by testing the hypothesis that "bacteria secrete communication molecules to alert other bacteria of phage infection", but what they ended up finding was arguably much more interesting. They began their series of experiments by simply growing bacteria in liquid media with and without bacteriophages (see the figure to the right). They let the mixture sit long enough for the phages to infect their bacterial hosts for a couple of replication cycles (3 hours), and then removed all of the bacteria and phages from the liquid by filtration. At this point, if there was a signaling molecule released during the infection, it would still be in the media even though the phages and bacteria were removed. Additionally, if there was a signaling molecule released during the phage infection period, repeating an infection in that same media would result in altered growth patterns (for example, less bacteria killed when the molecule is present). As it turns out, this is exactly what they observed.

The group found that phage infections were much less efficient when done in media that had already been used for phage infections. After careful study, the researchers found that the signaling molecule was in fact a small protein that was associated with the phage, not the bacterial host. The signal was highly phage specific. This meant that their observation was not of bacterial warning as initially hypothesized, but rather phage signaling to other phages. This is the first time such an extracellular signaling mechanism has been described between phages (at least as far as I know), which is pretty significant.

Following further characterization, the group found that the protein is (could be) used by many different phages to signal to other of the same phages whether they should enter a lytic replication cycle (reproduce and kill the bacterial host) or a lysogenic cycle (integrate into the bacterial genome and exist silently). The authors end their paper with a proposed mechanistic model for the phage to phage communication. They call this system the arbitrium system, after the latin word for decision.

The authors propose this mechanistic model for phage signaling.

What really makes this study cool is the implications it could have for microbiology and associated clinical applications. I think this finding could be especially important for our understanding of the human microbiome and virome. As we study the microbiome we strive, in part, to understand how bacteria and phages interact in human systems such as the gut, and understanding phage to phage signaling will be important for obtaining a more accurate picture of the system.

These findings could lead to some very interesting experiments. How would a cocktail of this type of signaling molecule (the authors identify many) alter gut virus or bacterial communities? How would this impact microbiome stability? Would a decrease in phage lytic capabilities significantly disrupt the kill-the-winner dynamics we see in the human microbiome, and result in low bacterial diversity with some un-checked bacteria taking over? The human microbiome is a complicated system, but this could be a step toward better understanding its dynamics, and maybe even contribute toward therapeutic applications.

I also think that these findings could be important for phage engineering and phage therapy. One of the big challenges in phage therapy is obtaining lytic bacteriophages that can effectively kill the pathogenic bacterial target. Lysogenic phages can also be effective in phage therapy, although they may be more effective if lysogeny could be avoided. It may also be advantageous to knock this gene out of phage therapy candidates.

In the end, this study has a lot of implications and I bet microbiologists are already thinking of hundreds of experiments they can conduct. And that is really what makes this study cool. It not only offers important information to the field, but it really captures and inspires the imagination of other scientists who read it. So if you have not read it yet, I highly suggest you go check it out.

What were your thoughts about the study? What implications do you think this will have for microbiology and the human microbiome? Let us know in the comments section, along with all of your questions, comments, and concerns. You can always reach out by Twitter or email as well.

References

Erez, Z., Steinberger-Levy, I., Shamir, M., Doron, S., Stokar-Avihail, A., Peleg, Y., Melamed, S., Leavitt, A., Savidor, A., Albeck, S., Amitai, G., & Sorek, R. (2017). Communication between viruses guides lysis–lysogeny decisions Nature, 541 (7638), 488-493 DOI: 10.1038/nature21049

How to Write a Manuscript Submission Cover Letter

2016-12-11T17:49:00.000-05:00

The communication of our research findings is a foundational pillar to our careers as scientists. One of the most common ways we scientists share information is by publishing papers in peer-reviewed journals. This primary method of information dissemination allows us to share our research findings both to our colleagues as well as the public at large. When preparing a manuscript for submission to a journal for peer review and subsequent publication, a lot of work goes into preparing a variety of documents. One of the important documents is a cover letter to the editor. This letter represents a significant hurdle for new and young researchers because it is often unclear what a cover letter should actually look like, and what information should be included. In this week's post I want to go over what a good cover letter could look like and how you can write your own. I say this is what it could look like because there is certainly a lot of room for interpretation and personal style, and there are many correct ways to do it. Here I am just going to cover one potential way to tackle the problem.

Before we get into the specifics, let's first discuss what a cover letter actually is. Again the exact answer can vary between people, but I think most could agree that it is an opportunity to introduce the journal editor to the manuscript you are submitting. This is an opportunity for you to briefly introduce the problem you are addressing, explain why your manuscript is important, and discuss why your manuscript should be published in that journal. Additionally, you can provide some of the subtle information associated with the paper, such as suggested reviewers and whether the article is already available in pre-print. This is not supposed to be a repeat of your abstract, but really just a brief letter providing an introduction to the entire work you are submitting.

So this description is fine and you can probably find something like that on some journal websites, but it is still vague. What does all of that look like in practice? To make it clearer, lets go through an example that I wrote out for this blog. The content is just a fictional example for a manuscript written by Jane Appleseed (first author) and Marissa Mayer (corresponding author). While the specific content is nonsense, the structure and themes for each section are real. Here is the general structure that you could follow for your own manuscript submission.

So there you have it, an example of how to write a cover letter for your next manuscript submission. As I said above, this is meant to be an example of how you could do it, but there are many good ways to write submission cover letters. The best way to learn how to write a good cover letter is to ask to read many of your colleagues' letters to see what you like about their style and structure.

If you have your own advice on how to write a successful cover letter, or have further questions, let us know in the comments below. As always, you can feel free to reach out to me on Twitter and by email as well. Happy submitting!

Summary of the 2016 International Human Microbiome Congress

2016-11-13T11:58:00.001-05:00

Kicking off the IHMC meeting for 2016.

This week I had the privilege of attending the 2016 International Human Microbiome Congress which was hosted in Houston, Texas in the United States. The goal of this recurring meeting is to get the worldwide human microbiome community together to discuss recent progress, current challenges, and future directions. In this post I want to give a summary of the meeting for anyone who could not attend.

Top Three Research Picks

Of course I cannot go into all of the meeting in detail, but I will provide some highlights and encourage you to keep a close eye on the literature as the work presented was either published or near publication. Here are my top three picks for the talks. I should also mention that this is based on the talks I was able to attend. I missed many of the talks during the concurrent sessions (as did everyone since many talks are given at the same time), and because I had to leave before the end of the last day. So these are the top three of what I saw.

Kjersti Aagaard is well known for her placenta microbiome work, which has been met with skepticism around whether the results truly represent a placenta microbiome or whether they are contaminants. It was clear that she is aware of this criticism and is working to address it (in addition to her other very cool research projects). The coolest was that she is using microscopy techniques like FISH to visualize bacteria that appear to be colonizing. Unfortunately it was a fast talk so I’m not going to try going too much into it. They are anticipating publishing the results in the near future however so it will be worth reading for sure.
Ami Bhatt was doing some very cool work with Triclosan and the microbiome. This is an interesting study on a unique cohort since Triclosan is now banned by the FDA in the US. She was also presenting some interesting FMT work. Not only were the results cool, but I thought her use of metagenomics was interesting, refreshing, and represented an understanding I wish I could say was ubiquitous throughout the meeting. Not only was she using shotgun metagenomic sequencing to get at the presence of functional genes, but was doing some cool work to look at SNP concordance between FMT donors and long-term recipients. She demonstrated a unique and informative approach that I really appreciated seeing.
Morgan Langille presented some very cool work utilizing a wide range of techniques to detect microbiome signatures that can be used to predict irritable bowel disease. Not only was the machine learning presented well, but I thought this was a really cool example of how we can effectively use multiple techniques to understand disease and the human microbiome. We often see a push to use different “-omics” techniques (for lack of a better term) but the studies are often implemented poorly, I think because of the difficulty in understanding how to effectively use them together. This seems like a good example of how it can be done well. They can be used to classify disease states using stool, and then we can go back to determine what factors of them all were most important, and how much more information we really get from each technique. It was another refreshing metagenomics approach that I appreciated seeing.

The Virome

I know that this is mostly expected, but I feel it is worth mentioning again. There was a large focus on bacteria without many talks for fungi and viruses (including bacteriophages). There were a couple, but the almost exclusive focus on bacteria has been a common theme in human microbiome research and I’m not surprised this conference also focused on bacteria. I just feel it is worth mentioning that the future of the human microbiome does not only include bacteria.

Metagenomics and the Microbiome

The theme for the meeting was “frontiers of microbiome science and metagenomic medicine”. This meant that there was a heavy focus on microbiome studies that utilized metagenomic shotgun sequencing to understand the human microbiome. I honestly felt throughout the meeting that this choice might have been a little too restrictive and had the focus too much on the method and not enough on the actual biology and medicine. There was certainly some excellent science, but it would have been nice to have the focus more on how we can use tools to answer important questions instead of looking for questions we can answer with a tool. But I could do a whole post on this so for now I am going to leave it at that. In the end, I think that the next meeting could really benefit from a broader theme that focuses less on a specific method. For example, I preferred the broad theme last year: “future directions for human microbiome research in health and disease”.

Wrap Up

So there you have it, my almost criminally short summary of the 2016 International Human Microbiome Congress. It was a meeting with highs and lows, and I was happy I was able to meet some cool people and see some interesting science. If you are interested in seeing the live tweeting archives, check out #IHMC2016 on Twitter. Questions, comments, or concerns? Please leave a post in the comment section, or reach out via Twitter or email. I always love hearing from readers.

Global Online Office Hours

2016-10-23T18:58:00.000-04:00

Global online office hours will be held monthly
through Google Chat.

Interest in the microbiome has continued to skyrocket. It seems like there is a new microbiome commercialization strategy everyday, and more and more scientists are looking to incorporate the microbiome into their research programs. It is certainly an exciting time for the microbiome. Unfortunately the increased demand has been met with a somewhat insufficient supply of information and resources. Of course there are some excellent resources out there, but a lot of people don't have access to a "microbiome researcher" to answer their questions. Sometimes this means newcomers to the field make some crucial mistakes because they are forced to "go it alone". In an effort to generate even more unique resources for all of the microbiome folks out there, I decided to hold Global Online Office Hours.

As the name suggests, these office hours will be held online and are open to anybody in the world. The idea is that anybody who has questions about microbiome research, the current state of the field, a recent study, or anything else, now has an opportunity to ask a real life microbiome researcher. Of course I encourage students to attend, but anybody can join in. This includes academics, industry scientists, journalists, etc.

Right now this is still in the experimental stage. I am trying to evaluate both the level of interest (is this a good idea?) and what the most effective format will be. Right now I am holding monthly office hours through the rest of the year. If there is a lot of interest, I am totally ready to bump it up to more frequent times. I also currently have these scheduled for a fixed time (in my afternoon), which means some time zones will have a hard time attending (i.e. it will be 0300 for some people). So again, if there is interest I will try to stagger my times to allow for more general audience participation.

If this sounds cool and you would like to attend and ask some questions, feel free to read more on the website. I also encourage you to sign up for the Google Group here because that is how I will communicate with the group members (by email).

So that's it for this week. If you have any questions, comments, or concerns, please feel free to let me know in the comments below, on Twitter, by email, or even in office hours! Hope to see you there!

A New Look At Irritable Bowel Disease and Viruses: The Core Human "Phageome"

2016-10-02T18:39:00.001-04:00

An illustration of the core protein clusters (PCs; groups
of similar genes) found in the photic and aphotic zones
of the ocean. This new study applies a similar approach
using phage genomes instead of genes. Source

Ongoing research has continued to implicate the microbiome in a variety of human diseases. We often hear about this in the context of bacterial communities. Certain bacterial communities appear to be associated with health, and disrupting these communities seems to be associated with disease. To better understand these bacterial communities, we sometimes group the shared members together as the "core bacterial community" that is associated with health or disease. In some ways these core bacteria are considered important to the system because they are found in every instance of health or disease. But what about the core phages (bacterial viruses) of these communities? A few weeks ago Manrique et al published a study that began addressing this question.

Manrique et al published a study in PNAS that looked at the core human "phageome" in health and disease. The goal of the study was to identify the core set of phages that are part of the human gut phageome and observe how they are changed in disease states. The purpose of this study is ultimately to identify those phages that are likely to play roles in maintaining health by identifying phages that are present in health and absent in disease. Overall I liked this paper and I will leave you to read it yourself for the study specifics. Here I just want us to briefly summarize the paper while highlighting the most important points.

The group began by assembling a small human cohort consisting of two subjects whose stool was sampled at two different time points. They purified the viruses out of the stool and sequenced the genomic DNA using whole genome shotgun sequencing. They combined the sequences from the four samples and used them to assemble approximately 4,000 contigs. As was expected, they identified a core set of phages that were present in all of the samples.

This was interesting, but what really made the paper cool was the expansion of their methods to a more robust, disease-associated virome dataset. The group performed their analysis on the Norman et al virome dataset, which includes purified virus (mostly phage) genomic DNA from the stool of healthy subjects, as well as subjects suffering from irritable bowel disease conditions. This dataset allowed the group to investigate how the core phage communities differed between healthy and diseased (IBD) states. The geographic diversity of the sampling also allowed them to account for location variation in the core virome.

Heatmap of the core, common, and unique phage
genomes found in the Manrique et al study.

The takeaway points were as follows:

A core gut virome exists.
The core gut virome is conserved across geographically distant populations.
The core gut virome signatures change in disease states.
Sequence homology clustering reduces core virome dimensionality while preserving population signatures.

In the end, what does this all mean? I think the biggest strength of this paper is that they are laying important groundwork for future studies of the human virome in the context of the "core virome". By identifying those phages present in all healthy states, the group has identified targets for future study that are likely to be important for a healthy system. This also establishes a new way for other researchers to start thinking about the viromes in their systems of interest.

So what's next? Here are my predictions for the future directions of this study:

The group will likely expand to additional body sites and disease states.
The group may go on to define the functional and predatory implications of core virome.
They or others will begin establishing an understanding of the associations between the core virome and the core bacterial communities.

Again, this is a cool paper and I suggest you check it out. I also presented this paper for our lab journal club a couple of weeks ago, and I made my slide deck available here if you want to check it out. Finally, and as always, feel free to reach out either in the comments below, on Twitter, or by email. I am always excited to hear from my readers!

Works Cited

Manrique P, Bolduc B, Walk ST, van der Oost J, de Vos WM, & Young MJ (2016). Healthy human gut phageome. Proceedings of the National Academy of Sciences of the United States of America, 113 (37), 10400-5 PMID: 27573828

How to Detect Circular Virus Genomes from Metagenomes

2016-09-18T15:59:00.000-04:00

Many virus genomes are circular, like the olympic rings.

When analyzing virus metagenomic data, we often find it helpful to identify contigs that represent complete circular genomes [1,2]. In addition to offering biological information, this is used as a quality control technique to evaluate whether the sequencing efforts were robust enough to allow for complete genome assembly. This approach has the advantage of reference independence because it does not require aligning reads to a reference genome to evaluate sequence completion.

Because there is no end or start to a circular genome, assembled circular contigs contain sequence repeats in which the whole contig begins to repeat after the whole genome has been sequenced. This trait is used to detect circular genomes by "closing" the contigs by identifying the repeated genome signature. This can be done by aligning the contig nucleotide sequence to itself to "close the circle", as represented in the figure below.

A linear contig representing a circular virus can be closed by detecting
sequence similarity at each end.

Because this approach is primarily implemented as "custom in-house scripts", it is hard to actually find good, freely available resources without hunting them down from their authors. In the interest of adding to the valuable open-source virome analysis resources available online, I wrote out a script that detects circular virus contigs. The script is titled ccontigs.jl and is available on the GitHub ccontigs repository. See the documentation there for details.

The first question you might be asking yourself is what the ".jl" file extension means? What language is that? The .jl extension means that this script was written in the Julia programming language that I am liking more and more for bioinformatics. I originally tried writing the program in Python using some BioPython tools but I found the pairwise alignment tool was quite slow and resource (memory) intensive. I had good experiences before with Julia before, especially with regards to performance, so I rewrote the script in Julia and tried it out. The Julia script drastically outperformed the Python version so I stuck with it. The downside is that you need to install Julia on your computer/server, but this is pretty easy with instructions found here.

As I was validating the script I noticed an important caveat to this approach that I hadn't really seen mentioned in the literature. Some linear virus genomes actually contain a repeat of the beginning of their genome again at the end of the genome (e.g. Staphylococcus phage MSA6). This means that a sequence similarity approach would "close" the contig as a circle even though it's a linear genome. Is this a problem? The answer depends on your question.

If you are claiming a contig truly represents a completed circular genome, this "closing" method alone is somewhat insufficient and will need to be supplemented with a different approach. The method will however provide strong support for using this as a QC measure to support sequencing efforts as representing a large fraction of a virome. Even if the genome is linear, "closing" it still provides strong evidence that you sequenced enough to cover the entire genome.

Moving forward from this post, we now have an efficient, open-source tool for detecting circular contigs. We are also aware of the caveat that some linear genomes may be mis-annotated as representing circular genomes, but the impact of this caveat varies with the experimental question being asked.

As always, please leave any questions, comments, or concerns in the comment section below, or reach out through Twitter or email. I am always happy to get feedback and help out other virome researchers.

And yes, I know the metaphor in the first figure is a bit of a stretch. :)

WORKS CITED

1. Minot, S., Sinha, R., Chen, J., Li, H., Keilbaugh, S., Wu, G., Lewis, J., & Bushman, F. (2011). The human gut virome: Inter-individual variation and dynamic response to diet Genome Research, 21 (10), 1616-1625 DOI: 10.1101/gr.122705.111

2. Manrique, P., Bolduc, B., Walk, S., van der Oost, J., de Vos, W., & Young, M. (2016). Healthy human gut phageome Proceedings of the National Academy of Sciences, 113 (37), 10400-10405 DOI: 10.1073/pnas.1601060113

Improving Human Virome Studies: Updates to Virus Classification

2016-08-27T19:44:00.000-04:00

The proposed phage proteomic tree by Rohwer and Edwards.

Taxonomy is an important aspect of microbiome research. Whether we are studying communities of bacteria, viruses, or other microbes, there are benefits to labeling microbes. Taxonomic names immediately give us information about their relationships to each other, such as similar bacteria being grouped into the same genus. Taxonomic identities also provide some information about an organism's functionality and/or clinical pathology. For example, by mentioning that a bacteria is a member of the genus Staphylococcus, you might think that it is a round, gram-positive bacterium that might inhabit the skin and is otherwise related to other members of that genus (including genomic relationships). In the end, the practice does what it aims to do, which is classify organisms in an informative way.

Although it might seem like a simple practice at first, it is actually a very complicated field that continues to improve due to the effort of many talented scientists. This is especially true for virus taxonomy. Although improving, phage taxonomy has continued to suffer from issues of ambiguity and inconsistency. In this post I want to go over the recently proposed improvements to phage taxonomic conventions. I feel this is particularly important to go over because it will impact the analyses done by human virome researchers, as well as virome researchers in general.

The manuscript outlining the changes is actually a very nice, short, and easy read, so I will direct you to it for details if you are interested. Overall, the changes reduce ambiguity and foster greater consistency in naming phages. Here is a list of the proposed changes, which are listed with greater detail in the manuscript itself.

1. Replace "phage" with "virus" in bacteriophage taxonomy names.

Example: "Escherichia phage T4" will become "Escherichia virus T4".

2. Removal of "like" from phage genus names.

Example: "Lambdalikevirus" will become "Lambdavirus".

3. Discontinuation of "phi" and other transliterated Greek letters.

Greek letters will be discouraged in names going forward.

4. Elimination of hyphens from taxon names.

Example: "Yersinia phage L-413C" will become "Yersinia virus L413C".

5. Specificity of isolation host in taxon name.

Example: "Enterobacteria phage T7" will become "Escherichia virus T7".

The group also discusses the ongoing efforts in using genomic similarity for defining virus genome similarity. For example, viruses with greater than 40% amino acid sequence similarity have been categorized as being in the same genus. As is perhaps expected, this can result in somewhat uninformative categorizations that collect somewhat dissimilar viruses. This will be an area of development that we will have to continue watching.

So what can we take away from this? Honestly this is a important paper for those of us interested in virus ecology, and especially the human virome. In a lot of ways, our understanding of the human virome is only as good as our reference databases. By clearing up ambiguities and inconsistencies in these databases, we can improve our ability to discuss the communities we observe and better equip ourselves with an understanding of the phage relationships.

Thanks for hanging in there to the end. I know taxonomy can be a bit of a dry topic for people, but it really is important and something we all need to try to stay current with, particularly if we think a lot about microbiology. Be sure to check out the manuscript itself for the whole story. For even further reading, check out the paper by Thompson et al. Finally, if you have any questions, comments, or concerns, please feel free to reach out either through the comment section below, Twitter, or email. I always love hearing from readers!

Works Cited

Krupovic, M., Dutilh, B., Adriaenssens, E., Wittmann, J., Vogensen, F., Sullivan, M., Rumnieks, J., Prangishvili, D., Lavigne, R., Kropinski, A., Klumpp, J., Gillis, A., Enault, F., Edwards, R., Duffy, S., Clokie, M., Barylski, J., Ackermann, H., & Kuhn, J. (2016). Taxonomy of prokaryotic viruses: update from the ICTV bacterial and archaeal viruses subcommittee Archives of Virology, 161 (4), 1095-1099 DOI: 10.1007/s00705-015-2728-0

Thompson, C., Amaral, G., Campeão, M., Edwards, R., Polz, M., Dutilh, B., Ussery, D., Sawabe, T., Swings, J., & Thompson, F. (2014). Microbial taxonomy in the post-genomic era: Rebuilding from scratch? Archives of Microbiology, 197 (3), 359-370 DOI: 10.1007/s00203-014-1071-2

Antibiotics, Birth, and the Microbiome: A Personal Experience

2016-07-31T21:46:00.000-04:00

The new addition to our family!

Well July has shaped up to be an incredible month. In addition to working on some cool projects whose results you will be seeing in the near future, my wife delivered our first child. Her name is Clara and we are very excited to be welcoming her into our family. Unfortunately the road to delivery was a little bumpy (although not nearly as bad as it could have been). One aspect of the process that stood out to me was the use of antibiotics during delivery. I thought this was interesting because we hear so much about the microbiome differences between vaginal and c-section births, but not much about antibiotic treatment. This week I wanted to share my experience with you, both to shed some light on what can happen during delivery, and to provide my own thoughts on the subject.

To jump right in, the delivery process started with my wife's water breaking, just as it does with many women. The only problem here was that labor didn't start after. As you might guess, this is particularly troubling because the open, moist, and incubated amniotic sac is an ideal environment for an infection. So once the water broke, the infection clock started ticking. The standard practice for this situation dictates that labor needs to start within 24 hours of water breaking, whether it be natural, augmented, or induced.

For us the 24 hours came and went, and despite our efforts to get labor started (walking, positions, etc), the contractions were only weakly progressing. This meant the team needed to augment my wife's labor, which involved providing a hormone (pitocin to be exact) to ramp up the contractions and start working the baby out. To cut a long story short, this was a very long process that my wife went through. And remember, this whole time we were racing to avoid an infection.

Because of the infectious risk, my wife's temperature was taken every half hour to an hour. A fever is one immediate indication of a potential infection. After about another day passed (about 48 hours after her water broke), her fever started to spike, which suggested the bacteria finally caught up to us and were starting to infect. Because the goal was to avoid infection (a pretty important goal for both the mom and baby), my wife was immediately administered broad-spectrum antibiotics to kill off the infecting bacteria. Luckily it seemed to work and her fever went back to normal for the remainder of labor.

As a microbiologist and microbiome researcher, I thought this experience was pretty interesting. We are constantly worried about the detrimental effects of antibiotics, and it's certainly true that antibiotics are misused. But I think we also need to talk about situations where a somewhat liberal use of broad-spectrum antibiotics really is the best course of action. If we think about the situation I described, we never actually knew that my wife had an infection. We only knew that she had started a fever (there was not time for culturing at that point, although they did follow up with that). All we knew was that there was a chance of an infection, and the benefits of avoiding such an infection outweighed the risks associated with those antibiotics. An altered microbiome might be bad, but an infected newborn baby is likely to be much worse.

There is a time and place for antibiotics, although
they are still misused.

So what am I trying to say? Should we keep throwing around antibiotics at every sign of a cough? Certainly not. Antibiotics are misused in many ways, and it is clear that we can benefit from more targeted treatment approaches (such as phage therapy of course!). On the other hand I think it is worth pointing out that there are still situations that necessitate the use of broad antibiotics. Antibiotics can cause problems, but they are still a miracle of modern medicine and will have a place in medical practice for a very long time.

In the end, both the mom and the baby left the hospital happy and healthy, and that is what I am grateful for. I am also very happy with the care we received at the University of Michigan hospital. They did a brilliant job!

So that was my recent experience with antibiotics. Thanks for bearing with a different kind of post this week, but I thought it might be interesting to share a personal story that relates to the research I write so much about. As always, feel free to reach out and let me know what you thought, or if you have any questions. Finally, I will wrap things up with a disclaimer that this was our experience, and every experience is different, so be sure to talk with your doctor if you find yourself in similar circumstances.

The Up-And-Coming Bioinformatics Language: A First Look At Julia

2016-06-26T21:41:00.001-04:00

Programming is a dynamic field that transitions from one language to another over the years. A classic example is the transition to Perl, which then transitioned into Python. The R language has also exploded in recent years, and all of these languages are used heavily in bioinformatics. Instead of focusing on the current state of bioinformatics, I want to focus this post on where we could be going in the future. More specifically, I want to discuss an up-and-coming programming language named Julia, which has potential for use in bioinformatics.

Julia is a new language that first appeared in 2012 and has been gaining attention ever since. The creators have focused on creating an efficient and fast language that is also relatively easy to use. Because people are talking more about it each day, and because I think it shows exceptional promise, I wanted to try it out for myself.

The Benchmarking

I was a little bummed when I saw their homepage benchmarking failed to include Perl, my goto language for a lot of the data munging associated with bioinformatics. Perl is also lightening fast for a scripting language, which makes it handy. I decided I would familiarize myself with the Julia language by setting up some basic benchmarking.

To get a feel for Julia's speed, I decided to recreate a Perl script that I use to calculate the median length of sequences in a fasta file. I downloaded Julia from the Julia website, installed it on my computer, and rewrote the Perl script in Julia. In total this took me about 1-1.5 hours, which highlights the ease of writing in Julia. It really took no time at all before I was writing a decent Julia script. I had never used the language before, but it is familiar to any Python or R user.

Once I had the two scripts, I ran them on the same example fasta file and compared the execution time required for both. I got the following results.

Comparison of Perl and Julia speeds for calculating the median sequence lengths in
an increasingly larger fasta file. Code is found here.

So the Perl script clearly ran faster than the Julia script, and both increased in time at about the same rate as I added sequences. So what can we say from these results? I would conclude that although Julia is fast, it still can't beat Perl for parsing data and making quick calculations. Of course this comes with the caveat that I have very little experience writing in Julia and could have written it poorly (I did try to make it efficient to give it a fair chance though). I also only tested the two on relatively small files, and the results may be different for very large files. Regardless, I still think this is informative.

Check out the associated data and code on the JuliaPerlBenchmark GitHub page.

Julia Pros

After spending some time with the Julia language, I really liked the familiarity of the syntax and data structures. Anybody with exposure to Python, R, or any similar high-level scripting/programming language will easily pickup Julia in about an hour or two.

I like that Julia seems to be a bit of a hybrid between R and Python. It seems like it could be really good for bioinformatics by allowing easy data formatting, analysis, and presentation in one cohesive and fast language environment.

Although it was a little slower than Perl for parsing sequencing data files, Julia is still a fast language and I think this will draw more and more bioinformaticians to use it.

Finally, Julia allows for easy integration with C, which I think will help with future development.

Benchmarking results provided on the Julia homepage.

Julia Cons

Although I like Julia, there are certainly some problems that will prevent me from switching over right now. The biggest issue is that it simply does not have the support and infrastructure that a language like Python or R has. Julia is still up-and-comming and the community is not at the same level as the R, Python, or Perl communities. I expect it will pickup in the coming years, but for now it just makes sense (for me) to work in the more developed communities of R, Python, and Perl.
Although Julia is fast, it still can't beat my simple and fast Perl scripting. Until it beats Perl performance in data formatting and management, I honestly won't have a strong incentive to make the move over to Julia heavy scripting.

Final Thoughts

Julia is a promising and exciting new programming language that I think we will hear more about in the next few years. The community is small and there is less support compared to Python and R, but that could (and probably will) change over time. The general feeling I got for Julia was that it was a combination of Python and R that offered me the best of each in one language. That, in addition to the speed advantages over R and Python, could allow Julia to replace Python and R as major programming languages in the near future. I really do think it is reasonable to expect Julia to be the bioinformatics language-of-choice in the next ten to fifteen years. Ultimately though only time will tell.

Any thoughts, comments, or concerns? Any bugs in my code or errors in my interpretations? Let me know in the comments below. You are also always welcome to reach out on Twitter or by email. I always love to hear from Prophage readers.

Update

I have been getting incredible feedback on this blog post and I wanted to update the readers with what I have learned, and how the data has improved. Thanks to the readers in the comments below, as well as on the GitHub repository, we have addressed two issues with the benchmark.

The script I wrote needed to be written more efficiently. Ismael rewrote the script to run more efficiently, and also provided a solid explanation of what they did.
As you can see in the comments, the problem with this test is that Julia is taking time to start and compile the code. The time required to get started is considerably greater for Julia, which is the biggest reason for why Perl appears to perform better. Given this information, you might predict that Julia could outperform Perl on larger file sizes where the startup time become negligible. I quickly bolstered the size of my file to about 500MB (from 30MB) and reran the benchmark. Wouldn't you know it, Julia begins to outperform Perl at larger file sizes, which is awesome. The updated results are below.

Updated comparison of Perl and Julia speeds for calculating the median sequence lengths in
an increasingly larger fasta file. Larger file than figure above. Code is found here.

So what what can we take away from this? It turns out that while Julia startup takes longer, it is blazing fast and actually outperforms Perl when using larger but reasonable files. With this new and more correct knowledge, I am happy to say that I am even more excited about Julia and think that it has a place in bioinformatics. Speed for me is a big thing, so I can see incorporating this into my own work.

I finally want to thank all of the readers who contributed to this blog post. I love that people were able to help make this little piece of data accurate and fair, and I feel like we all benefitted from the improved results. Thank you so much and please feel free to continue commenting.

Tips For Getting The Optimal Postdoc

2016-06-12T22:00:00.000-04:00

So you've been in grad school for a while, you've published some cool papers, and you are ready to graduate with your PhD and take the next step in your career. For many, this means pursuing a postdoc. But how do you get started, and what should you be thinking about? Since I was in this position only a short time ago, I felt I would share my thoughts on the process, hoping that it helps any readers getting ready for that same next step.

Before I go any further though, I want to get everyone on the same page (this is not just a blog for grad students). A postdoc (short for postdoctoral research fellow) is someone who has graduated with their PhD and is conducting supervised research but in a more independent capacity than during their thesis.

One of the first steps in preparing to embark on a postdoctoral research fellowship is figuring out which labs you should be considering, and then finally choosing one. But how are you supposed to decide? And even after you interview, how are you supposed to choose between the many excellent labs out there? Here are some points I considered during the process.

Define Your Next Step

Before you do anything, be sure you have a very clear idea of what you want out of your postdoc. Do you ultimately want an academic position? Are you aiming for an industry position? Are you unsure and want to keep your options open? All of these are wonderful choices, but each impacts the process in a different way. Without a clear idea of where you are going, you are going to have a difficult time deciding on the best option. I suggest actually writing out what you want your post-postdoc step to be, and then figure out which next step best prepares you for that.

Look For a Great Mentor

One of the most important aspects of a postdoc, and scientific training in general, is taking a position under an excellent mentor. I see a great mentor as someone who advocates for you, challenges you to do better, and supports your career goals. I could go on, but defining a great mentor warrants its own dedicated blog post. As you start the process of looking for labs, write down some qualities that your ideal mentor would have. When you start considering labs, think about whether the PI and other leadership meet that criteria. And be careful of the "prestige pitfall". I have seen many people take positions with unideal mentors (based on their individual criteria) because they are "famous" or "prestigious". Maybe that can work for you, but I have seen many people enter difficult situations in this way, so at least be aware of it.

Look For a Lab With Great Resources

Chances are that you want to ramp up your research once you start your postdoc. Chances are that you want some confidence in your ability to stay in the lab as well. Both of these come with solid lab resources. Having a well funded PI means you are more likely to have your position next year. It also means that you can ramp up your research program, get data and papers, and be more competitive in grant applications. You can certainly be successful without a lot of lab money and other resources, but being in a well funded lab means that is one less (big) limiting factor that you are going to have to worry about.

Location, location, location!

Look For a Lab in a Great Location

You are a scientist AND a human being. That means you likely want to be happy in your life and enjoy your environment. To this end, I encourage you to think about the location of each lab you are considering. For example, if you love skiing, Florida might be a less ideal fit for you. Conversely, if you absolutely hate the snow, Minnesota would be a poor fit.

Look At The Lab Track Record

Talk is cheap. Don't just ask if a lab is good, but look at whether they are producing (or capable of producing) the type of scientist you want to be. If you want to get into industry, you might want to take a second look at that lab who had a few members go onto industry positions. Of course this is more difficult for newer labs who simply don't have any track record, but I still believe this is an important process to go through.

Make The Most Of The Interview

Once you have narrowed your list down to a few labs, you are going to travel to the lab and interview. Remember that this is as much about you interviewing them, as it is them interviewing you. Be sure to prepare for the interview with a list of questions to ask, and a set of goals you want to achieve. And actually write it out! This includes questions to ask the PI, as well as the other lab and department members. This is the best time to get a feel for the lab and figure out if it is a good fit.

Go With Your Gut

You have been around a lot of labs at this point, so you have a good idea of what you are looking for. Even if you have a hard time articulating the exact feeling you have for different lab, you probably have a good "gut feeling" for what will work best for you. Trust your instincts.

Choose A Lab

The final and most difficult step of the process is choosing a lab. This is especially hard because you have already narrowed down your options to great labs, and they are honestly all probably good choices (I know they were in my experience). In the end you have to talk to your loved ones, go for a walk to think, and go with your gut on what option you want to commit to. It's a near impossible decision to make for most people, but take comfort in knowing that all of the choices are probably great.

Final Thoughts

So there you have it. Some general thoughts I have on the whole postdoc hunting process. Of course these are just my opinions and musings, and the process is different for each person. But hopefully this will be a good starting place for thinking about finding that perfect postdoc position. And if you are non-professional scientist and reading this, I hope this gave you some insight into what we think about in our scientific careers. Additionally, high-five for making it to the end of a very long post!

Do you have any thoughts about the postdoc search process? Did I miss any crucial pointers? Do you have questions as you start the process? Feel free to let me know in the comments below, in an email, or on Twitter. I always love it when people reach out.

Piggybacking Instead of Killing: New Insights Into Virus Community Dynamics

2016-05-22T20:17:00.000-04:00

The human microbiome is an important component of human health and disease. It is an ecosystem of microbes that exists in and on humans, and can affect disease states through disturbances in composition, diversity, metabolism, etc. Understanding the human microbiome will not only allow us to better understand human health, but it will also allow us to treat medical conditions in new and effective ways (e.g. Fecal Microbiota Transplants).

Most studies to date have focused on understanding the bacterial component of the human microbiome. While this route has proven beneficial, it fails to consider the more complex system at large. Bacteria are interacting with communities of microbes including viruses (including bacteriophages which are viruses that infect only bacteria), and understanding these phage-bacteria dynamics is crucial for understanding the true human microbiome system. Our paper this week provides such insights into the dynamics of virus communities and their interactions with their bacterial hosts.

This paper by Knowles et al builds off of two observations. The first is that many phage-bacteria communities have been modeled to follow the "kill-the-winner" model of predation. This model states that lytic phages target and kill the most successful bacteria (the "winners"), thus preventing dominance of a single successful bacterium and maintaining relatively even bacterial distributions. The second observation is that many community phages are in fact temperate (they can exist while silently integrated in their bacterial host genome) and are poorly incorporated into the existing kill-the-winner model. To reconcile this disagreement, Knowles et al developed an extended model called "piggyback-the-winner".

Examples of cyclical predator/prey relationships which
are observed in phage-bacteria systems. SOURCE

The proposed "piggyback-the-winner" model states that instead of "killing the winner" when bacterial density increases, lytic activity is instead suppressed and an increased proportion of phages enter their dormant, integrated infectious state. This model is based largely on the observation that virus density often decreases as "microbe" density increases. The group provides a variety of sources of evidence to support their model in viral communities at large (please read the paper for details).

One point of concern with this paper is that the group relies heavily on linear relationships between bacteria and phages, when we know that these predator-prey dynamics often follow cyclical patterns. This is not to say that the study is flawed or less valuable, but it would have been nice to hear more about the implications of the more accurate cyclical models over the linear models that were used. This is especially relevant because some of the scatter plots seem to be approaching more of a cyclical pattern than linear.

tl;dr

So what can we take away from this paper? Knowles et al is proposing a new predator-prey model called the "piggyback-the-winner" model which essentially states that more microbes equals fewer viruses. The group primarily supports their model with linear abundance modeling from a variety of microbiomes, spanning from oceans to humans. This is a valuable step toward our understanding of the entire microbiome (bacteria, viruses, etc) and will inform future studies, both environmental and medical. We are also likely to see this model develop as more sophisticated techniques are used.

If you enjoyed our discussion, go ahead and check out the full paper in Nature. There you can find all of the details that we skimmed over here in our brief discussion. It is actually a relatively short read so it is worth checking out. And of course if you have any comments to add or questions to ask, speak out in the comments below, reach out on Twitter, or send an email!

Works Cited

Knowles B, Silveira CB, Bailey BA, Barott K, Cantu VA, Cobián-Güemes AG, Coutinho FH, Dinsdale EA, Felts B, Furby KA, George EE, Green KT, Gregoracci GB, Haas AF, Haggerty JM, Hester ER, Hisakawa N, Kelly LW, Lim YW, Little M, Luque A, McDole-Somera T, McNair K, de Oliveira LS, Quistad SD, Robinett NL, Sala E, Salamon P, Sanchez SE, Sandin S, Silva GG, Smith J, Sullivan C, Thompson C, Vermeij MJ, Youle M, Young C, Zgliczynski B, Brainard R, Edwards RA, Nulton J, Thompson F, & Rohwer F (2016). Lytic to temperate switching of viral communities. Nature, 531 (7595), 466-70 PMID: 26982729

Real Time Code Correction with Linters

2016-05-15T22:45:00.000-04:00

If you are a regular around here, or if you even took a look at the date since the last post, you may have noticed a gap. As tends to happen in the blog world, I took a hiatus to focus on other projects and research. This was actually very productive and I am excited for you to see the fruits of those labors in the coming months. So thanks for sticking with us and joining in the return of Prophage activity.

This week I want to talk about improving our everyday programming using something called a linter. A linter is simply a program that runs with a text editor and checks for stylistic and programming errors (if you already know about these, I apologize as this is a simplified explanation). So put another way, it is like the spellcheck and grammar check functions we have seen in word processors (e.g. Microsoft Word) except it works with programming languages. Now I have only started using linters in the past couple of months, but I have totally fallen in love and wish I had been using them a long time ago. In this post we are going to familiarize ourselves with linters and hopefully end up downloading them to use in our own programming.

The first question we might ask is why we should use a linter? It sounds like just another complicated program to have to deal with. Well just like when we type emails or text messages, we make mistakes like typos and we rely on spell check and grammar check to alert us when a potential mistake has occurred. In the end we get a much cleaner, clearer, and professional document. A linter does this with code. It will alert us when it appears we have made a syntax error, or may have written in something unstable that could behave unexpectedly. An example of such a correction is changing working directories in a bash script.

# If I write this
cd ~/documents
# My linter tells me "Use cd ... || exit in case cd fails".
# So I change it to this safer line
cd ~/documents || exit

Not only did the linter save me from an unsafe directory change, but if I was unaware of that danger in the past, I now know and can avoid it in the future. So in the end it can save you from errors, as well as teach you along the way.

So now that we have seen linters can be helpful, how do we set them up on our computers? There is linter for almost every language, and these linters can be used in almost every text editor. I spend a lot of time programming in Bash, R, and Perl, and I have found the associated linters to be incredibly beneficial.

R

Lintr

Bash

Shell Check

Perl

Perl Linter

Perl Critic

I use all of these in Sublime Text, but you can also use them in most other text editors. Check this out for help getting started with Sublime Linter. Please note that downloading Sublime Linter does not include all of the linters for various languages, so you have to install those additionally. However this installation is easy and they walk you through the process.

And there you have it. Everything you need to get started with linters in your own programming practices. As always, if you have any questions, comments, or concerns about this post, please leave me a message in the comments. You can also reach out to me by email or Twitter.

Happy coding!

*Image Source

The Illumina Error Profile for Metagenomic Sequencing

2016-04-03T21:20:00.000-04:00

Microbiology, and especially microbial ecology, has become increasingly dependent on advanced DNA and RNA sequencing technologies. This is most evident with the increasing popularity of the human microbiome and its various impacts on human health. While using DNA sequencing sometimes appears relatively simple (a result of the great efforts made to simplify the user experience), it is actually still a very complicated technique that requires a lot of thought and skill. One aspect that genomic scientists (whether focusing on human or microbial DNA) must always consider is the bias introduced by the sequencing platform itself. This week I want to focus on a recently published manuscript that describes the sequencing error profile associated with some of the most popular Illumina platforms.

We know that sequencing platforms introduce systematic biases. Last year a group showed this to be true when performing 16S rRNA amplicon sequencing on Illumina platforms [1]. This year Schimer et al (from the same lab) expanded on that work by characterizing the errors associated with metagenomic sequencing techniques (i.e. random shotgun sequencing)[2].

The paper aims to address four points:

Define error rates of substitutions and indels between platforms.
Identify sequence motifs associated with errors.
Evaluate ability of quality scores to predicts different error types.
Compare error removal approaches across platforms.

In the end they come to the following conclusions:

Substitutions are more frequent than indels and their frequency varies by platform.
Errors are associated with trimer motifs that are consistent across sequencing platforms.
Base errors are associated with low quality scores.
Quality trimming and Bayes Hammer are most effective for reducing errors when used together.

There was one additional point that I thought was worth noting since Schimer et al didn't really get into it in the paper. The group talks about nucleotide motifs associated with errors, and make a note of error-associated adenine and thymine residues. This is interesting because adenines are used at the end of a sequences after it has read through the DNA fragment. Said another way, when a DNA fragment is shorter than what the sequencing platform is reading, it will read through the DNA and, once it falls off the end of the fragment, start inserting a string of A's as a placeholder. As far as I can tell, the research group did not perform the quality control step of trimming these A (and T for the reverse compliment) strings, meaning their analysis could be picking these up. This would mean that the A's could be throwing off their other analyses such as motif identification and sequence alignments. Because there were other error-associated motifs, it seems unlikely that this point ruins the paper, but it is important to note when interpreting their results.

Overall this is a really cool paper filled with a lot of important information for anybody interested in doing microbial metagenomics. I definitely suggest reading it, as well as keeping it around as a reference. Additionally, since I presented this paper in our lab journal club, I have my slide deck freely available for you to download. Check it out here.

Works Cited

1. Schirmer, M., Ijaz, U., D'Amore, R., Hall, N., Sloan, W., & Quince, C. (2015). Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform Nucleic Acids Research, 43 (6) DOI: 10.1093/nar/gku1341

2. Schirmer, M., D’Amore, R., Ijaz, U., Hall, N., & Quince, C. (2016). Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data BMC Bioinformatics, 17 (1) DOI: 10.1186/s12859-016-0976-y

My Experience Sharing Protocols in the New "protocols.io" Environment

2016-03-20T19:13:00.000-04:00

Scientists publish methods in their manuscripts, but these summaries can fail to capture the technical details involved in the described processes. Many scientists get around this by by making the actual step-by-step protocols freely available to the public. There are a variety of avenues for accomplishing this. Some scientists publish their protocols with their manuscripts, some post them in public archives, and others publish them on their lab websites. There are advantages and disadvantages to these approaches, and most of us are always learning about new and improved resources to facilitate the sharing process. I recently learned about the online resource protocols.io, which is a surprisingly robust and free resource for sharing experimental protocols.

I originally learned about protocols.io when their group VERVE Net (of the Hurwitz lab) graciously transcribed our group's published computational protocols over to the protocols.io environment. Our computational protocols were originally archived on FigShare. Although the protocols were only recently uploaded to protocols.io, I have been impressed with what it offers.

The most compelling benefit I can see with using protocols.io is that it offers a new degree of visibility to your research. By being part of their environment, your protocols are searched for, and viewed by, a wider scientific audience. This means that more people will learn about your research, use the approaches you developed, and cite your work as a beneficial contribution to the field.

The other benefit I can see from using protocols.io is the user-friendly interface optimized for scientific use. Not only can you search for and view protocols, but you can follow along with them in their app step-by-step, check off completed tasks, use integrated timers, etc. You can also fork protocols (i.e. make your own copy) that you can update to meet your own experimental needs. It definitely feels like it was influenced by common software version control resources such as Git.

Now this is all fine, but wouldn't a system like this only be good for "wet lab" protocols and not computational workflows? It certainly seems to have been built for "wet lab" protocols, but it works surprisingly well with computational workflows too. I honestly don't see it replacing source code repositories such as GitHub, but I do think it has a place for publishing widely-used standard operating procedures (SOPs) for various bioinformatics tasks. As an example, the Mothur SOP for processing 16S rRNA sequencing data is available on the Mothur Wiki as a step-by-step workflow. I could see a reference workflow like this being published in the protocols.io environment.

So in the end, I would suggest checking protocols.io out. It is a cool effort toward promotion of scientific transparency and collaboration, and I think you could benefit from using it.

Did I miss something or fail to elaborate on a point you want to hear more about? As always, I invite you to let me know in the comments below. I would love to hear from you!

Helping Both Humans and Dogs: A Recent Study of Canine Atopic Dermatitis

2016-03-06T21:46:00.001-05:00

Example of canine Atopic Dermatitis, as seen
in the manuscript we are discussing.

Atopic dermatitis (AD), which is also referred to as Eczema, is a very common dermatological disease, especially in children. It is estimated that AD affects 10% of children. The disease presents as dry, scaly, itchy skin. Atopic dermatitis can be especially problematic when the victim (often a child) itches the skin extensively, thereby increasing susceptibility to skin infections. Treatment of the disease ranges from controlling the itchy skin with soothing topical medication to bathing the patient in dilute bleach (the bleach bath technique).

In addition to genetics, there is evidence that AD has a microbial component. More specifically, research has linked the disease to Staphylococcus bacteria colonization that may play a role in flares and disease control. The bacteria and human genetics are thought to be linked in part by the impaired skin barrier function (e.g. control of water loss, acidity, etc) that results in an altered environment for the bacteria, and especially Staphylococcus, to grow.

What makes this week's study by Bradley et al particularly interesting over existing AD microbiome studies is that they investigate both the altered bacterial communities, as well as the altered barrier function of the diseased skin itself. Their study focused on canine AD, and so was conducted entirely in dogs. Canine AD affects approximately 10% of dogs, and perhaps more importantly, it closely resembles the human disease, thus providing information relevant to human medicine.

The group conducted their study with a cohort of 32 dogs, 15 of which were diagnosed with canine AD. Each dog had various skin sites swabbed for microbiome analysis by 16S rRNA gene sequencing (the standard approach for studying the microbiome). Sampling was done over time, so the temporal dynamics of the disease could also be visualized. Like previous studies, the group found that flaring skin was associated with an increased dominance of Staphylococcus in the microbiome (measured as relative abundance). They also found the diseased skin was associated with altered bacterial diversity, and that antimicrobial therapy restored the microbiome to a healthier state.

Example of a non-invasive device used to measure skin
barrier function.

The study really got cool when they evaluated the barrier function of the diseased skin and linked that data to their microbiome data. In the end, they found some links between microbiome diversity and some aspects of impaired barrier function. I emphasize some because the correlations were only between certain microbiome and barrier signatures. Overall this may suggest the AD microbiome signatures are the results of an altered skin environment due to impaired barrier function. Perhaps the presence of the bacteria are feeding into the progression of the skin flare? There are a lot of interesting research directions that this could go, and it will be exciting to watch where the group takes it next.

In the end, this is a cool study and it is worth reading. The group provides valuable insight into a common disease both for humans and dogs. Moving forward, I would be very interested in seeing the group look more into the links between skin barrier function and Staphylococcus colonization. This might include a heavier immunological study that further investigates the molecular response of the AD skin to the microbes. It will be interesting to see what they come up with.

So to totally wrap things up, I want to thank you for reading. This blog is not possible without you the reader. I would also love to hear from you about any questions, comments, or concerns. Feel free to leave a comment below, email me, or Tweet me.

Works Cited

Bradley, C., Morris, D., Rankin, S., Cain, C., Misic, A., Houser, T., Mauldin, E., & Grice, E. (2016). Longitudinal evaluation of the skin microbiome and association with microenvironment and treatment in canine atopic dermatitis Journal of Investigative Dermatology DOI: 10.1016/j.jid.2016.01.023

Methods Matter: Getting Started with the Skin Microbiome

2016-02-14T16:00:00.003-05:00

Your choice of sequencing approach matters. Think
about your goals and the methodological caveats
before starting your experiments.

The field of microbiome research has been hugely popular in the past few years. It has forced us to rethink our approaches to various medical practices, and has captured the imaginations of both amateur and professional scientists. With this popularity has come an influx of scientists trying to incorporate the microbiome into their own research. It is of course great that people want to get into the field, but unfortunately it is deceptively difficult for newcomers who are not always aware of how best to get started. This has led to the execution of poorly designed studies that could have been improved by more methodological resources in the literature. To this end, my colleague (and lab mate) led a research project to evaluate the differences between sequencing methods of the skin microbiome, a consideration that is often overlooked by newcomers to the field. This week I want to briefly hit the highlights of the paper and suggest that you read it if you are interested in starting any skin microbiome work.

The study was led by Jacquelyn Meisel in Elizabeth Grice's laboratory, and was published in the Journal of Investigative Dermatology (the premier dermatology research journal). In their study, Meisel et al evaluated the effects of three different sequencing methods for studying the skin microbiome.

Whole metagenome shotgun (WMS) sequencing, which means the entire genomes (or genome fragments called contigs) of the skin bacteria were sequenced instead of a specific region (e.g. 16S rRNA). This method is costly and more difficult to analyze, but can provide answers to many questions regarding the genomic structure of the communities that cannot be answered using techniques involving marker genes.
16S rRNA V4 region gene sequencing, which means the fourth variable region (V4) of all bacteria within the bacterial community is sequenced and used to provide taxonomic/phylogenetic information. Variable regions are used because the high throughput sequencing technologies cannot span the entire length of the gene, and the variable regions allow for the greatest differentiation between different bacteria (if we used a conserved region, they would all look the same). This method is great because it is cheaper, provides strong taxonomic/phylogenetic information about the community, and is sufficient to answer many research questions. It does not provide sequences for the entire genomes however.
16S rRNA V1-3 region gene sequencing, which is the same approach as V4, although it is covering variable regions 1-3 instead of four. Different variable regions provide different resolution between members of the community because they are differentially variable between groups of bacteria. This region in particular is longer than V4, which means it can provide more information at the expense of being more difficult to sequence.

Illustration of the variable regions within the 16S rRNA
gene. The valleys are regions of low conservation, and
are labeled as variable regions 1-9. <Source>

So what did the group find? The highlight was that the V4 region poorly characterized the skin community, while the V1-3 and metagenomic approaches were much more accurate (accuracy was determined by sequencing a known community and comparing the results to the known composition). The most striking limitation to sequencing the V4 region was its inability to capture Propionibacteria.

The reason for using metagenomic approaches over 16S sequencing is thought to be that the metagenomic data allows for an understanding of the functional potential of the community. Meisel et al found that the functional predictions made using 16S data was similar to that found in the metagenome samples, meaning you are getting comparable results but paying considerably more for the metagenomic data.

The group also evaluated the effects of these methods on the resulting diversity calculated for the communities. They found that the resulting diversity was in fact impacted by the sequencing approach, highlighting a danger in comparing results from different studies that used different sequencing methods.

Now I know I have an obvious bias since I was a part of this research, but Jackie (Jacquelyn) led an excellent study that provides an important resource to the field. If you are curious about the importance of sequencing methods, or if you yourself want to incorporate this type of study into your research, I suggest checking this paper out. It can help you to interpret other skin microbiome studies, and could prevent you from making costly mistakes in your own research.

As always, I would love to hear your questions, comments, and concerns in the comment section below, or through email/Twitter. You can find my information to the right.

Works Cited

Meisel JS, Hannigan GD, Tyldsley AS, SanMiguel AJ, Hodkinson BP, Zheng Q, & Grice EA (2016). Skin microbiome surveys are strongly influenced by experimental design. The Journal of investigative dermatology PMID: 26829039

The Open Metagenome Toolkit Project

2016-02-07T21:20:00.001-05:00

Almost two years ago I started collecting some scripts that I wrote for my own microbial metagenomic analyses. These are some relatively simple Perl and Python scripts that do some common tasks that are required when studying bacterial or viral metagenomes. This collection of scripts if called the Open Metagenome Toolkit. I recently added a few more scripts that I think are helpful, including a script to translate nucleotide sequences and a script to calculate the average lengths of reads. This week our post is about the Open Metagenome Toolkit because it is a cool opportunity for collaborative programming in our microbiome community.

Of course I hope you will use this toolkit in your own research because I think it will make your life a little easier. But even more so, I hope you will head over to the Github repository and show off some of your coding skills by contributing some scripts or adding to the scripts that are already there. If you are just getting started with coding, you can use this as a learning opportunity by adding to the existing scripts and getting some feedback.

The point of this project is to facilitate collaboration. With that comes proper credit to every contributor. Therefore the least we can do is include the names of the contributors on the project homepage, along with a link to their homepage. So go ahead and contribute, and actually be a part of the project no matter what skill level you are at.

In addition to its focus on collaboration, this toolkit focuses on mobility. It relies only on Perl and Python, which are so common that they actually come pre-installed on many operating systems. There is no requirement for installing additional programs or modules, including BioPerl and BioPython. This is a major strength because it means the user does not have to install any dependencies. This is also a nice exercise in programming, and I think offers a high degree of control to the programmer.

So now that you have read the intro, go over and check out the toolkit. It is easy to download, easy to use, and easy to get involved with. If you have any ideas for functions that should be added, go ahead and add them in the issue section. Otherwise you can directly add to the scripts.

Any questions or comments? Let me know in the comments below, on Twitter, or by email. I would love to hear from you!

Recent Study Reveals Role for Bacterial Viruses in Microbiome Evolution

2016-01-24T20:46:00.000-05:00

The microbiome is a complex community of bacteria,
viruses, and other microbes.

Microbial communities are fierce battlegrounds between bacteria and other microbes competing for limited resources. One method some bacteria use to kill their competitors is the production of bacteriocins. Bacteriocins are protein toxins produced by bacteria to limit the growth of related bacteria, thereby providing a competitive advantage to the bacteriocin-producing bacteria. This dynamic is important to our health because it can impact bacterial infections and overall microbiome composition. The group of Nedialkova et al recently added a whole new level of insight into bacteriocins and microbial ecology by linking bacteriocin production to the presence of bacteriophages (bacterial viruses).

Overall this was a pretty straightforward study and a nice read. The research group recognized that Salmonella enteric genome contain a myriad of prophage genomes, which means the virus genome are integrated into the bacterial genome and are waiting to come out into an infectious cycle when the bacteria is stressed (a process called phage induction). This is medically relevant, because many antibiotics can induce bacteriophages.

The group provided evidence for phages playing an important role in colicin release (the Salmonella bacteriocin) by removing the viruses out of the cultured bacterial genomes and observing a resulting decreased ability of the bacteria to release their bacteriocin. They attempted to pinpoint the phage genes involved in bacteriocin release from the bacteria, but this ultimately served to highlight the complex cell signaling involved in bacteriocin production and release. The group wrapped their study up by showing that by affecting colicin "use", the phages impact the evolution of S. enterica by affecting their competitive advantages. This was tested by competing the Salmonella with E coli bacteria that are commonly found in the human gut.

I really like this paper because it provides even more evidence on how important phages are for bacterial functionality and evolution. This role for phages is relevant to isolated bacterial systems, but is also very important for the human microbiome. Phages are important for the structure and function of the human microbiome, and thereby impact human health in a big way. Overall this really shows how complex the human microbiome is, and how important it is to study the phages in these communities, instead of focusing only on the bacteria.

So now that we have previewed the paper, I suggest looking it up and reading the real thing. It is a well written and straightforward paper that is worth reading. And finally, if you noticed I left anything out or missed a point you think is worth bringing up, shoot me a comment below. You should also always feel free to reach out on Twitter or by email.

Works Cited

Nedialkova LP, Sidstedt M, Koeppel MB, Spriewald S, Ring D, Gerlach RG, Bossi L, & Stecher B (2015). Temperate phages promote colicin-dependent fitness of Salmonella enterica serovar Typhimurium. Environmental microbiology PMID: 26439675

A Primer on Linear Regression and its Associated Misconceptions

2016-01-17T22:11:00.000-05:00

Welcome to the new year and the first Prophage blog post for 2016! This is already looking like it will be a great year for science and blogging. But enough with the pleasantries, let's dive into some science.

I wanted to start the year off with post about math. I know, I know, math is an intimidating way to start the year, but don't run off yet! I swear that this will be painless and we will even learn something new! We are going to keep things simple and focus on an elegant paper that presents some misconceptions about a complicated topic. This topic is multiple linear regression. My goal is to introduce you to the topic of linear regression and prepare you to read this week's paper.

What is Linear Regression & When Should I Use It?

Before we talk about multiple linear regression, let's cover simple linear regression. In its most simplified form, linear regression is a method for modeling the interaction between an independent (i.e. explanatory) and dependent variable. This is often plotted as a scatter plot with the dependent variable on the y axis, the independent variable on the x axis, and the linear regression model drawn as a line (see figure below).

We commonly use this approach when we want to predict a dependent value given an independent value. An example of this (in the plot below) is tree age vs diameter. We know that tree diameter depends on age, but what if we want to predict the diameter (dependent variable) of a tree at a given age (independent explanatory variable). We can perform a linear regression to create a simple predictive model (shown as the line) to tell us what the diameter is likely to be at a given age. In our example, at age 30 it looks like the tree diameter will be 5 inches. The slope of the line is a coefficient that represents the relationship between the explanatory (age) and dependent (diameter) variables.

What is Multiple Linear Regression & When Should I Use It?

A simple example of linear regression modeling.
Here we are modeling the relationship between
tree age and diameter. SOURCE

Now what if we want a better model that includes more than one explanatory variable. For instance, what if we want to predict tree diameter given it's age and the average summer temperature of the climate the tree lives in? We might expect a tree in a colder climate to have less of a diameter compared to a tree in a warm climate. Once we start considering more than one variable, we are doing a multiple linear regression. It's that simple. Much like in a simple linear regression, both explanatory variables (age and temperature) have a coefficient that represents the relationship between the explanatory and dependent variable. Think of this relationship coefficient as the slope for each explanatory variable.

What is the Misconception?

As Frasier TR expertly points out, there is a lot of confusion around interpreting these relationship coefficients. People often interpret these as being the independent relationships between the explanatory variables (age and temperature) and the dependent variable (diameter) given the full range of values of the explanatory variables. This is unfortunately not true. These coefficients only represent the relationship (i.e. slope) between their associated independent variable and dependent variable when the other independent variable is zero. So to use our example, the coefficient associated with age only represents the relationship (slope) between age and diameter when the temperature is zero. Frasier expertly outlines why this is actually a nontrivial point that has likely led to many erroneous scientific conclusions. Frasier's explanation is incredibly well done so I will direct you to followup with this post by reading the paper and seeing his examples for why this distinction is important.

Wrapping It Up

I know this was a math heavy post, but I hope you enjoyed it and even learned a little. After reading these brief paragraphs, you should have a general feel for what linear regression is and why it is useful. This will prepare you to dive into the Frasier paper that is absolutely with a read. And of course, I want to end by pointing out that this is a complicated topic that you can read entire books about. We did not even scratch the surface in this post, but at least we took the first step toward a better understanding of math and how it can be used for prediction.

Questions, comments, or concerns? Want to discuss any of these points? Add a comment below. I would love you hear what you think.

Works Cited

Frasier TR (2015). A note on the use of multiple linear regression in molecular ecology. Molecular ecology resources PMID: 26650184

Understanding How Silent Phages Can Prevent Detection of Potentially Deadly Food Contaminants

2015-12-27T23:19:00.000-05:00

Many bacteria are detected by culturing, or growing them
out on plates of artificial media.

Contamination of food with bacteria is a huge issue that can sometimes cause life-threatening illness. The bacterial culprits can include E. coli, as well as Listeria monocytogenes. L. monocytogenes is a potent bacteria that can very effectively infect its human host. This bacterium is especially problematic for pregnant women whose newborn children can develop meningitis that can lead to complications as severe as death. Because this is a serious infectious agent, there have been a lot of quality control efforts towards detecting this bacterium in food before it is sold. In this week's post, we are going to discuss a relatively recent study that highlights the role of phages in these efforts. The study does this by showing that nutrients used in the tests can activate silent phage infections and prevent bacterial detection.

L. monocytogenes, as well as other bacteria, are often detected using culturing techniques. This means that the bacteria are actually grown on special media in a petri dish. Simply put, if we streak our sample across the plate and see the dangerous bacteria growing, we conclude that the bacteria is present and can potentially cause an infection. Like most tests, these culture techniques are not 100% accurate and can have erroneous results. Of these incorrect results, false negatives (failure to detect bacteria that are present) are of particular concern because they allow the contaminated food to be sold.

In a recent report, a group led by Letaitre et al investigated a potential cause of false negative results by evaluating the roles of phages in the culturing process. We know that bacteria are capable of being silently infected by phages that can come out into an active infection and kill the host bacterium when it is stressed. These stresses can include nutrient conditions. In their recent paper, Letaitre et al found that these mechanisms may be responsible for false negative results in L. monocytogenes tests.

Phages and tails that were detected in the study.

The study is overall fairly straightforward and a good read. The group tested a variety of components from standard test media that are widely used in the detection of L monocytogenes contaminants. They found that many components are in fact capable of inducing bacteriophages, which means the compounds in the media are capable of killing the bacteria by activating silent phage infections. By killing the bacteria through phage induction, the contaminating bacteria will not be detectable by culture test, and the final result will incorrectly indicate a lack of L monocytogenes. In the end, this highlights the importance of understanding how phages impact the quality control tests that are being used, and also suggests a different media should be considered in quality control detection of L monocytogenes.

As I mentioned above, this is a fairly straightforward read and I would suggest checking it out, especially if you are interested in the details. Any questions, comments, or concerns? Let me know in the comments below, or feel free to shoot me an email anytime. And remember to always consider the phage component.

Works Cited

Lemaître JP, Duroux A, Pimpie R, Duez JM, & Milat ML (2015). Listeria phage and phage tail induction triggered by components of bacterial growth media (phosphate, LiCl, nalidixic acid, and acriflavine). Applied and environmental microbiology, 81 (6), 2117-24 PMID: 25595760