Overview
Since the introduction of high-throughput, second-generation DNA sequencing technologies, there has been an enormous increase in the size of datasets being used for estimating bacterial population phylodynamics. Although many phylogenetic techniques are scalable to hundreds of bacterial genomes, methods which have been used for mitigating the effect of mechanisms of horizontal sequence transfer on phylogenetic reconstructions cannot cope with these new datasets. Gubbins (Genealogies Unbiased By recomBinations In Nucleotide Sequences) is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions. Simulations demonstrate the algorithm generates highly accurate reconstructions under realistic models of short-term bacterial evolution, and can be run in only a few hours on alignments of hundreds of bacterial genome sequences.
The paper
The paper is available from Nucleic Acids Research (open access): Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins.Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. "Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014.
Installation
Detailed installation instructions are available in the README file.
If you already have anaconda installed, then run:
conda create -n gubbins
conda activate gubbins
conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda install gubbins
User manual
Instructions on how to use the software are detailed in the manualQuickstart tutorial with example datasets
- A small set of S. pneumoniae PMEN3 assemblies can be downloaded here (6.5 Mb archive of assemblies and a reference annotation). A brief tutorial on how to analyse such data is here here
- A small S. pneumoniae PMEN1 dataset can be downloaded here (6.4 Mb multi-FASTA whole genome alignment file). The expected output is here
- A small S. aureus ST239 dataset can be downloaded here (11.3 Mb multi-FASTA whole genome alignment file). The expected output is here
Feedback/Issues
Please report problems to the issues page.
Frequently asked questions
What type of dataset can be analysed with Gubbins?
Gubbins detects recombinations through the locally-elevated densities of polymorphisms that arise when segments of sequence are acquired from a donor that is genetically divergent from the set of sequences being analysed. We recommend that you divide your population into strains using PopPUNK, and use Gubbins on a whole genome alignment of each strain separately.
Can the output of Roary be used as the input to Gubbins?
The output of Roary (the pan genome pipeline) cannot be used as the input to Gubbins. Gubbins requires a whole genome alignment as input, in order to analyse the spatial distribution of base substitutions.
Can an alignment of polymorphic sites be used as the input to Gubbins?
An alignment of polymorphic sites cannot be used as the input to Gubbins. Gubbins requires a whole genome alignment as input, in order to analyse the spatial distribution of base substitutions.