Quality-filtering, processing and analysis of 16S ribosomal RNA tag datasets using Tornado pipeline

This is a tutorial on our process of analyzing 16S rRNA tag datasets. In this tutorial we show how one can use our Tornado pipeline to clean up and align the data, and how to quantify the degree of neutral/niche community dynamics in microbial metagenomic samples.

Tutorial steps

arrow Preparation and alignment. How to remove short reads, dereplicate sequences and remove chimeras. How to align the 16S reads using Mothur (with SILVA database). How to visually inspect large metagenomic datasets, and how to hand-curate the primer regions.

arrow Clustering. How to cluster sequences into OTUs using the complete linkage clustering algorithm. How to use the resulting files to analyze OTU diversity information.

arrow PCA snapshot of the community. How to create a PCA decomposition "snapshot" of the OTUs.

arrow Quantifying neutral/niche dynamics. How to quantify the community assembly of the OTUs and the degree to which their dynamics is neutral or niche-driven.

Preparation and alignment

Note: Make video full-screen and choose HD resolution to see full detail.

What you see in the video

This tutorial uses the software packages Mothur, SeaView and splicer. Beginning with a set of sequences in FASTA format, the first step is to remove sequences shorter than N bases (typically N=100):

splicer remove-short FASTA.fa N > FASTA.trim.fa

We can then use uchime in Mothur to detect chimeras in the set of sequences, and use splicer to remove the detected chimeras. In the command line:

mothur "#chimera.uchime(fasta=FASTA.trim.fa, reference=silva.gold.align)"

splicer cull FASTA.trim.fa FASTA.trim.uchime.accnos > qc.fa

We are now ready to align the clean FASTA file using Mothur:

mothur "#align.seqs(candidate=qc.fa, template=silva.bacteria.fasta, flip=t, threshold=0.7)"

A more sophisticated alignment process that we have used in the past involves simultaneously aligning the sequences using the RDP/Infernal online alignment tool and then merging the two alignments. However, we now find it sufficient to use just the single alignment from Mothur.

The video then shows how to remove short sequences and empty columns from the aligned sequences using splicer. The final step in the alignment process is hand curation. In the video, we decide to trim off the left and right edges of the sequences using

splicer cut FASTA.aligned 23-424 > FASTA.handcurated

Here, the first 22 bases and all bases beyond position 424 are removed.

Back to tutorial steps


Note: Make video full-screen and choose HD resolution to see full detail.

What you see in the video

In this tutorial, we use the psi-distance and C-linkage packages from Tornado. The first step is to identify the unique sequences in the aligned, hand-curated FASTA file. We do this using splicer:

splicer dereplicate aligned.fa

We can then calculate the distance matrix for the alignment:

psi-distance aligned.fa.derepl > pdist.mat

Using the distance matrix file pdist.mat, we can cluster the sequences into OTUs at 97% PSI:

c-linkage pdist.mat aligned.fa.names only 0.03

The resulting list file contains a list of all the OTUs and their constituent sequences.

Back to tutorial steps

PCA snapshot of the community

This video shows how to create a PCA decomposition of the OTU data coming from a 16S metagenomic survey. The full description of this method is given in the 2012 PNAS paper by Jeraldo et al.

Note: Make video full-screen and choose HD resolution to see full detail.

What you see in the video

Once files are downloaded and upacked: First we must convert the OTUs into a numeric vector format. On the command line:

python to-binary.py DATA.FASTA DATA.LIST > DATA.BINARY

Then we run the PCA transformation. This will generate a 'DATA.BINARY.pca' file. Start Octave, and on the Octave command prompt:


Finally we can plot the PCA decomposition as a scatter plot in two principal axes. The size and color of the circles in the plot indicates the size of the OTU. To do this, on the command line:

python plot_pca.py DATA.BINARY.pca

Back to tutorial steps

Quantifying neutral-niche dynamics

This video shows how to create a minimum-distance-to-niche metric. The full description of this metric and related methods is given in the 2012 PNAS paper by Jeraldo et al.

Note: Make video full-screen and choose HD resolution to see full detail.

What you see in the video

Once files are downloaded and upacked: You will need to compile a fast C program to compute sequence distances directly from python. On the command line:


First, we need to create a JSON formatted file that includes both clustering and sequence information. On the command line:

python make-efasta.py DATA.FASTA DATA.LIST > DATA.EFASTA

Then we can compute the distance to the nearest niche. On the command line:

python calc-mindist-efasta.py DATA.EFASTA > DATA.MINDIST

Finally, we can plot the minimum distance information like in our paper. On the command line:

python plot-mindist.py DATA.MINDIST

Back to tutorial steps

Related downloads

arrow Download the example files, a group of files used as a starting point for tutorials 3 and 4.

arrow Download the NNP bundle, a group of scripts in Python and GNU Octave. These scripts allow one to compute the weighted PCA decomposition of a large, clustered 16S metagenomic dataset. They also include a tool to combine clustering, abundance and sequence (FASTA) information into a JSON format. Furthermore, the bundle includes a tool to generate the nearest distance metric. Requirements: GNU Octave, GNU Octave Statistics package, Python 2, Python matplotlib and numpy.

License and copyright information

arrow All software available for download on this website is copyrighted by its authors and available under the GNU General Public License version 3.

Funding Agency Disclaimer      License and Terms of Use

Logo of the University of Illinois at Urbana-Champaign