Gubbins

Genealogies Unbiased By recomBinations In Nucleotide Sequences

Download source code as a .zip file Download source code as a tar.gz file

Overview

Since the introduction of high-throughput, second-generation DNA sequencing technologies, there has been an enormous increase in the size of datasets being used for estimating bacterial population phylodynamics. Although many phylogenetic techniques are scalable to hundreds of bacterial genomes, methods which have been used for mitigating the effect of mechanisms of horizontal sequence transfer on phylogenetic reconstructions cannot cope with these new datasets. Gubbins (Genealogies Unbiased By recomBinations In Nucleotide Sequences) is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions. Simulations demonstrate the algorithm generates highly accurate reconstructions under realistic models of short-term bacterial evolution, and can be run in only a few hours on alignments of hundreds of bacterial genome sequences.

The paper

The paper is available from Nucleic Acids Research (open access): Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins.

Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. "Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014.

Installation

Detailed installation instructions are available in the README file. If you already have anaconda installed, then run: conda create -n gubbins conda activate gubbins conda config --add channels r conda config --add channels defaults conda config --add channels conda-forge conda config --add channels bioconda conda install gubbins

User manual

Instructions on how to use the software are detailed in the manual

Quickstart tutorial with example datasets

A small set of S. pneumoniae PMEN3 assemblies can be downloaded here (6.5 Mb archive of assemblies and a reference annotation). A brief tutorial on how to analyse such data is here here
A small S. pneumoniae PMEN1 dataset can be downloaded here (6.4 Mb multi-FASTA whole genome alignment file). The expected output is here
A small S. aureus ST239 dataset can be downloaded here (11.3 Mb multi-FASTA whole genome alignment file). The expected output is here

Feedback/Issues

Please report problems to the issues page.

Frequently asked questions

What type of dataset can be analysed with Gubbins?

Gubbins detects recombinations through the locally-elevated densities of polymorphisms that arise when segments of sequence are acquired from a donor that is genetically divergent from the set of sequences being analysed. We recommend that you divide your population into strains using PopPUNK, and use Gubbins on a whole genome alignment of each strain separately.

Can the output of Roary be used as the input to Gubbins?

The output of Roary (the pan genome pipeline) cannot be used as the input to Gubbins. Gubbins requires a whole genome alignment as input, in order to analyse the spatial distribution of base substitutions.

Can an alignment of polymorphic sites be used as the input to Gubbins?

An alignment of polymorphic sites cannot be used as the input to Gubbins. Gubbins requires a whole genome alignment as input, in order to analyse the spatial distribution of base substitutions.