Harvest software suite for rapid genome alignment and visualization

Release status

07/21/14: v1.0

Overview

Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes. Harvest includes Parsnp, a fast core-genome multi-aligner, harvest-tools, a binary archive format and format conversion tool, and Gingr, a dynamic visual platform. Combined they provide rapid core-genome alignments, variant calls, recombination detection, and phylogenetic trees.

_images/example.png

Contents:

fixme

fixme

Gingr: Interactive visualization of multi-alignments

Gingr is an interactive tool for exploring large-scale phylogenies in tandem with their corresponding multi-alignments. Gingr can display informative overviews for hundreds or thousands of genomes, while allowing researchers to move quickly to more detailed views of specific subclades and genomic regions, even down to the nucleotide level of their multi-alignments. Additionally, its dynamic display of variants allows interactive selection of various filters, such as indels, poorly aligned regions and suspected sites of recombination. Gingr works chiefly in tandem with Parsnp, an efficient tool for core-genome multi-alignment and phylogenetic reconstruction. It is also applicable, however, to other analytical tools, accepting standard file formats such as multi-Fasta, XMFA, Newick and VCF.

Contents:

Quickstart

Browsing a Gingr file

_images/ggr.png
  • The phylogeny appears on the left. Hover over a clade to highlight and outline the corresponding tracks to the right.
_images/clade.png
  • Click to zoom in on the clade
_images/zoomed.png
  • The multiple alignment appears on the right, shown as a SNP heatmap when zoomed out. To see the full alignment, zoom in with the mouse wheel or by selecting a region in the ruler.
_images/bases.png

Importing other files

  • Create a new workspace (File->New)
_images/new.png
  • Download the data files
  • Import the alignment with the refrence (File->Import Alignment (XMFA & Fasta))
_images/xmfa.png
  • The track highlighted in blue (“england.gbk.fna.srt”) is the current reference for variants. Select a new reference by right-clicking on a track.
_images/reref.png
  • Next, import the phylogenetic tree (File->Import tree (Newick))
_images/tree.png
  • Reroot the tree at the midpoint (Tree->Reroot at midpoint)
_images/reroot.png
  • The tree will now be balanced at the center of the longest path
_images/rerooted.png
  • Finally, import the annotations (File->Import annotations (Genbank))
_images/annotated.png
  • The workspace can be saved to share or return to later (File->Save)

fixme

fixme

fixme

fixme

fixme

fixme

harvest-tools: binary archive and format conversion tool

harvest-tools primarily serves as a reference-based compression tool for large genomic datasets. In addition, it serves as a proxy between several alignment and variant & annotated feature formats. While it was designed for use with both Parsnp & Gingr, it can also be used as standalone tool to efficiently compress multiple alignments, variants, phylogenies and annotations to a single convenient binary archive.

Contents:

Quickstart

Download, install & run

harvest-tools is distributed as a precompiled binary. The three steps below represent the fastest way to start using the software:

On OSX:
On Linux:
Basic usage:

From command-line:

harvest-tools –x <input xmfa> -f <input reference fasta> -g <reference genbank formatted annotations> -n <newick formatted tree>
Advanced usage:

harvest-tools quick start for three example scenarios.

With reference & genbank file as input:

harvest-tools -g <reference_genbank_file1,reference_genbank_file2,..> -r <reference fasta file> -x <XMFA file> -o hvt.ggr

With harvest-tools file as input, XMFA output:

harvest-tools -i input.ggr -X output.xmfa

With harvest-tools file as input, fasta formatted SNP file as output:

harvest-tools -i input.ggr -S output.snps
Command-line parameters:
  • -b: <bed filter intervals>,<filter name>,”<description>”
  • -B: <output backbone intervals>
  • -f: <reference fasta>
  • -F: <reference fasta out>
  • -g: <reference genbank>
  • -h: (show this help)
  • -i: <harvest input>
  • -m: <multi-fasta alignment input>
  • -n: <Newick tree input>
  • -N: <Newick tree output>
  • –midpoint-reroot
  • -o: <hvt output>
  • -q: (quiet mode)
  • -S: <output for multi-fasta SNPs>
  • -v: <VCF input>
  • -V: <VCF output>
  • -x: <xmfa alignment file>
  • -X: <output xmfa alignment file>

Primary output files

  1. Compressed binary archive (GGR)

fixme

Installation from source

Build

For your convenience, precompiled binaries provides at:

http://github.com/marbl/harvest

otherwise, to install from source:

 git clone https://github.com/marbl/harvest-tools.git hvt_src
 cd hvt_src

./bootsrap.sh
./configure [--prefix=...] [--with-protobuf=...]
make
[sudo] make install

Products:

  • command line tool ( <prefix>/bin/harvest )

  • static library ( <prefix>/lib/libharvest.a )

  • includes
    • Harvest, Protocol Buffer message class ( <prefix>/include/harvest.pb.h )
    • HarvestIO, Harvest wrapper with file IO ( <prefix>/include/HarvestIO.h )

Additional notes:

When running ./configure, use –prefix to install somewhere other than /usr/local. Use –with-protobuf if the Protocol Buffer libraries are not in their default location (/usr/local). Zlib should be installed in a standard system location. Sudo will be necessary for ‘make install’ if write permission

fixme

fixme

fixme

fixme

Parsnp: rapid core genome multi-alignment

Parsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments. Parsnp leverages contextual information provided by multi-alignments surrounding SNP sites for filtration/cleaning, in addition to existing tools for recombination detection/filtration and phylogenetic reconstruction.

Contents:

Quickstart

Download, install & run

Parsnp is distributed as a precompiled binary that should be devoid of external dependencies (all included in dist). The three steps below represent the fastest way to start using the software:

On OSX:
On Linux:
Basic usage:

From command-line:

parsnp –p <threads> –d <directory of genomes> –r <ref genome>
Advanced usage:

Parsnp quick start for three example scenarios.

With reference & genbank file:

parsnp -g <reference_genbank_file1,reference_genbank_file2,..> -d <genome_dir> -p <threads>

With reference but without genbank file:

parsnp -r <reference_genome> -d <genome_dir> -p <threads>

Autorecruit reference to a draft assembly:

parsnp -q <draft_assembly> -d <genome_db> -p <threads>
Command-line parameters:

Input/output:

-c = <flag>: (c)urated genome directory, use all genomes in dir and ignore MUMi? (default = NO)
-d = <path>: (d)irectory containing genomes/contigs/scaffolds
-g = <string>: Gen(b)ank file(s) (gbk), comma separated list (default = None)
-o = <string>: output directory? default [./P_CURRDATE_CURRTIME]
-q = <path>: (optional) specify (assembled) query genome to use, in addition to genomes found in genome dir (default = NONE)
-r = <path>: (r)eference genome (set to ! to pick random one from genome dir)

MUMi:

-M = <flag>: calculate MUMi and exit? overrides all other choices! (default: NO)
-U = <float>: max (M)UMi distance (default: autocutoff based on distribution of MUMi values)

MUM search:

-a = <int>: min (a)NCHOR length (default = 1.1*Log(S))
-C = <int>: maximal cluster D value? (default=100)
-z = <path>: min LCB si(z)e? (default = 25)

LCB alignment:

-D = <float>: maximal diagonal difference? Either percentage (e.g. 0.2) or bp (e.g. 100bp) (default = 0.12)
-e = <flag> greedily extend LCBs? experimental! (default = NO)
-n = <string>: alignment program (default: libMUSCLE)

SNP filters:

-R = <flag>: disable (R)epeat filtering?
-x = <flag>: enable recombination filtering? (default: NO)

Misc:

-h = <flag>: (h)elp: print this message
-P = <int>: max partition size? limits memory usage (default= 15000000)
-p = <int>: number of threads to use? (default= 1)
-v = <flag>: (v)erbose output? (default = NO)

Output Files

  1. Newick formatted core genome SNP tree: $outputdir/parsnp.tree
  2. SNPs used to infer phylogeny: $outputdir/parsnp.vcf
  3. Gingr formatted binary archive: $outputdir/parsnp.ggr
  4. XMFA formatted multiple alignment: $outputdir/parsnp.xmfa

fixme

Installation from source

Required before building:

  • 64-bit Linux/*nix or OSX (>= v10.7)
  • autoconf && automake
  • gcc (>= v4.2.*)
  • Python (>= 2.6.*)

Build (local)

For your convenience, precompiled binaries provides at:

http://github.com/marbl/harvest

otherwise, to install from source:

git clone https://github.com/marbl/parsnp.git parsnp_src
cd parsnp_src

First (important!), build libMUSCLE:

cd muscle
./autogen.sh
./configure --prefix=`pwd`
make install

Then, build Parsnp:

cd ..
./autogen.sh
./configure
make install

Once both installed (to cwd install by default):

export PARSNPDIR=/path/to/parsnp/install

Alternatively build instructions (requires sudo)

First (important!), build libMUSCLE:

cd muscle
./autogen.sh
./configure --prefix=/usr/local/
sudo make install

Then, build Parsnp:

cd ..
./autogen.sh
./configure --prefix=/usr/local/
sudo make install

FAQs

  1. What is Parsnp ?
  1. Parsnp takes both draft and finished genomes of closely related strains as input, performs conservative core genome alignment and as output returns multi-alignments (XMFA), variants (VCF), core genome phylogeny (Newick) and Gingr input format (GGR).
  1. Why should I use Parsnp?
  1. The main advantages of Parsnp over alternative approaches is robust filtration of variant (SNP) calls, multiple alignments as output and superior speed. Parsnp can align 200-300 bacterial strains in <30 minutes on a 16-core server and ~1000 in a couple of hours.
  1. Why should I *not* use Parsnp?
  1. If you are interested in pan genome/whole genome alignment, existing tools for the job that perform well include Mauve, Mugsy, among others. In addition, Parsnp is tailored for intraspecific genome analysis (outbreak analysis of a pathogen, etc). One main limitation of Parsnp is that it cannot handle subsets (core genome only) and is not as sensitive as existing methods.
  1. How can I visualize the results?
  1. Gingr (http://github.com/marbl/gingr) can open Parsnp output and provide an interactive display of multi-alignments, variants and the phylogenetic tree estimated from the core genome alignment.
  1. Only a small percentage (<40%) of the reference genome covered by the alignments, huh?
  1. Parsnp is a conservative core genome alignment method that necessarily requires that all genomes are present in each aligned regions. The focus is on aligning 1000s of closely related bacterial strains quickly while maintaining sensitivity comparable to existing WGA methods. In additon, the core genome has been shown to contain as few as 30-40% of the gene content (even in very closely-related clades) due to reductive genome evolution and/or a large accessory genome (with plenty of IS/phage elements). However, for increased sensitivity w.r.t aligned regions, and alignments containing subsets, both Mugsy and Mauve are terrific tools for the job.
  1. I am not sure which reference genome to choose, help!
  1. Since the core necessarily includes all genomes, the choice of reference does not matter (with a couple important exceptions, continue reading). Feel free to use the parameter ‘-r !’ to randomly select a reference if you are feeling indecisive or if all of the genomes contained in the genome directory are of similar quality. Typically, finished/closed genomes are used a the reference strain to ensure they are high-quality and do not contain assembly artifacts or contaminant.
  1. How are the genomes selected from the input directory? Not all of the genomes I wanted included are in the tree!
  1. By default, parsnp calculates the MUMi distance between the reference and each of the genomes in the genome directory. All genomes with MUMi distance <= 0.01 are included, all others are discarded. To force all genomes present in the genome dir to be included simply include ‘-c’ as a command-line parameter.
  1. Parsnp fails to report a SNP position output by method X, why is this?
  1. The goal of parsnp is to capture all informative signals found in the core genome of the specified clade of interest. Any SNPs in regions not shared by all genomes will not reported. Additionally, any SNP found in a likely poorly aligned region would also be discarded. Finally, parsnp does not perform LCB extension and therefore may miss SNPs appearing at the end of locally conserved blocks or clusters.

Tutorial

To further demonstrate the functionality of Parsnp we have prepared two small tutorial datasets. The first dataset is a MERS coronavirus outbreak dataset involving 42 isolates. The second dataset is a selected set of 31 Streptococcus pneumoniae genomes. Both of these datasets should run on modestly equipped laptops in a few minutes.

  1. 42 MERS Coronavirus genomes

    • Download genomes:

    • Run parsnp

      ./parsnp -r ./mers42/England-Qatar_2012.fna -d ./mers42
      
    • Command-line output

    _images/runm1.png
    • Visualize with Gingr:
    _images/runm1.gingr.png
    • Inspect Output:

  2. 31 Streptococcus pneumoniae genomes

    • Download genomes:

    • Run parsnp

      ./parsnp -r ./strep31/NC_011900.fna -d ./strep31 -p <num threads>
      
    • Example output:

      _images/run1.png
    • Force inclusion of all genomes (-c)

      ./parsnp -r ./strep31/NC_011900.fna -d ./strep31 -p <num threads> -c
      
  • Command-line output:

    _images/run2.png
  • Visualize with Gingr

    _images/run1.gingr.png
  • Enable recombination detection/filter (-x)

    ./parsnp -r ./strep31/NC_011900.fna -d ./strep31 -p <num threads> -c -x
    
  • Re-visualize with Gingr

    • Bootstrap values have improved after running recombination filter; columns with filtered SNPs are displayed in image:
    _images/run2.gingr.png

fixme

fixme

fixme

fixme