TURNIP Howto

This page gives a quick overview about how to set up the TURNIP suite and relevant data files and directories.

Installation and Prerequisites

Installation is a quick process, once the following prerequisites have been satisfied.

In addition to Perl >= 5.8.0, you will need to install the following modules prior to running TURNIP:

BioPerl >= 1.6.3+
Benchmark
Data::Dumper
List::Util
Parallel::ForkManager
Set::Scalar
Spreadsheet::WriteExcel

You can install them all by using CPAN:
perl -MCPAN -e shell

Then:
install <packagename>

Or, you can just unpack the libraries directly to somewhere on your perl @INC path.

We have found problems with older versions of BioPerl and Fastq module next_seq() needed for the process_fastq.pl script (see bug report here), therefore please install BioPerl 1.6.3 or greater.

You will also need to install:

- BLAST: and make sure that the executable directory (e.g. /path/to/blast-2.2.22/bin) is on your PATH (check your operating system documentation on how to do this). PLEASE NOTE: TURNIP only supports the previous legacy versions of BLAST and not the newer BLAST+ executables (legacy BLAST can be found here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/). BLAST+ will be supported in the next release of TURNIP.
- ImageMagick: the convert program is used from the ImageMagick suite in tree_gen.pl to convert the PostScript files generated by newicktops into PNG files.

Once the prerequisites above have been satisfied, then installation is very simple. Please download the TURNIP source package and unpack it, copy the whole TURNIP distribution directory to a sensible place if necessary, configure TURNIP, and run the scripts as detailed in the howto below.

Configuration

TURNIP is configured via a single file with simple key-value pairs describing parameters for various processes within the TURNIP suite. An example configuration is supplied as the /conf/example.conf file. Precise details about each option is available on the Prefs documentation page, but the options that you are most likely to change are outlined below:

  • datapath: Setting up the data directories is the initial step. The datapath property in the conf file relates to an absolute or relative directory in which the separate FASTA and quality files for your strain(s) to be analysed are stored. Each organism or strain should have two files each, one for the FASTA read sequences, and one for the respective read quality scores. If you have a single FASTQ file, TURNIP contains the process_fastq.pl script to separate this into the two required file formats. Please see the relevant page on how to do this if necessary.
  • qual_db_suffix: the suffix of the quality files supplied to TURNIP in the datapath directory, e.g. '.qual'
  • fasta_db_suffix: the suffix of the FASTA files supplied to TURNIP in the datapath directory, e.g. '.fasta'
  • smp: this should be set to '1' if you would like to use multiple cores
  • processes: this should represent the number of cores to use, e.g. '4'. This option will only come into effect if smp is set to '1'
  • strain_list: this is a comma-separated list of strains to analyse, and should match the filename of each FASTA/qual file followed by the respective suffix specified in the above two options. For example, an entry in the strain_list of '273614X' and a suffix of '_qual.clipped.sc' would look for the quality score filename '273614X_qual.clipped.sc'

TURNIP is able to be run concurrently in a multithreaded environment. If your machine has multiple cores, then please set the smp property to '1' and set the number of processes to fork using the processes property. TURNIP can sensibly run with 2 processes per core, i.e. a dual core machine should have processes=4, a cluster machine with 64 cores can be set at processes=128. You may have issues with BLAST if you specify many cores on a 32bit operating system. To be able to specify more cores (usually something greater than 8), you should consider running TURNIP on a 64bit operating system with the 64bit BLAST version.

Running TURNIP

The TURNIP distribution comes packaged with the SGRP project S. cerevisiae data in the data/ directory. The supplied conf/example.conf file is set up so that TURNIP is immediately runnable with the given distribution configuration.

If you are analysing your own datasets and therefore given that your configuration file is correctly pointing to the relevant data directories, then running TURNIP is simple:
perl turnip.pl -c conf/example.conf
Each of the relevant separate scripts will be called in turn by the "parent" turnip.pl script, so explicit calling of these "child" scripts is unnecessary. Furthermore, if any problems occur during the running of TURNIP, it should be a simple matter of re-running the same command and TURNIP will commence from a previous suitable point in the analysis. However, in-depth perldoc documentation is available for each of the individual scripts, as part of this website and packaged up with the TURNIP distribution.

You should then see output scrolling down the screen reporting information about each stage of the TURNIP variation detection mechanisms. You can obviously pipe this output to file if you want to check the output or store it for later perusal. An example of this output can be found here.

Following a successful run, TURNIP will output data in 4 formats, i.e. text (simple human readable output), GFF (for use with GBrowse), SQL (for importing into a database) and XLS (for use in a spreadsheet program). The Excel workbook contains two sheets, the first with "raw" variation totals, and the second sheet (suffixed "2") which attempts to cluster the indels into sequential groups and sum these groups instead of the raw counts. The different variation types are colour-coded, and low frequency pSNPs are coloured dark green whereas high frequency pSNPs are light green. Examples of these outputs for the S. cerevisiae strain BC187 are available here: TXT, GFF, SQL, XLS

TURNIP also contains scripts to generate distance matrices and subsequent phylogenetic trees using the GENDIST and quicktree programs. Please see the matrix_gen.pl and tree_gen.pl documentation for how to run these scripts, and visit the online TURNIP variation maps page for on-the-fly generation of these matrices and trees.

The TURNIP algorithm

An overview of the TURNIP algorithms is available in the supporting paper (DOI: 10.1101/gr.084517.108) and presentations (PDF, PDF). The general algorithm is outlined in a supplementary flow diagram, and described in detail in the following section.

The algorithm needs three inputs, a text file containing a consensus sequence, a FASTA file containing all the reads of the query organism, and a file containing the quality scores for the corresponding reads. The HitSeries.pm->generate() subroutine takes the consensus file and the FASTA reads file and runs a liberal blastall on all the reads against the consensus, resulting in a set of 'seed' reads that loosely map to the consensus region and will be used for the following stages (the hitseries.out file). All read qualities are stored on disk in binary hashed indices (<strain>.qual.index), using the read ID as the key to improve speed of lookups in all following relevant stages of the algorithm. Please see the IndexDict.pm script documentation for this step.

hitseries.out is then processed by the main Turnip.pm script. The generate_positional_attempts subroutine generates sequential 100mer slices of the consensus that act as the reference anchor 'window' and distinct read sequences (different reads with the same 100mer sequence are not represented more than once, and therefore not checked more than once) for that window. A read sequence may be shorter than 100 bases due to the read only partially overlapping the consensus at the given position. This is stored in the positional_attempts.dat file, and an example of this output is below:
>Test sequence
atcatcgatcgatgcatcgatcgtagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcgatcgtacgggcggggcggatcggct
>foo1.p1k
atcatcgatcgatgccgatcgtagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcgatcgtacgggcggggcggatcggct
>foo4.p1k
atcatcgatcgatgcatcgaaaatagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcgatcgtacgggcggggcggatcggct
>foo5.p1k
atcatcgatcgatgcatcgatcgtagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcgatcgtacgggcggggcggatcggct
>foo8.q1k
cgatgcatcgatcgtagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcgatcgtacgggcggggcggatcggct
>foo2.q1k
atcatcgatcgatgcatcgatcgtagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcg
positional_attempts.dat is then processed by the main Turnip.pm script. The generate_positional_alignments() subroutine runs a gapped alignment using MUSCLE on each distinct 'seed' against the consensus sequence 100mer. This is stored in the positional_alignments.dat file, and an example of this output is below:
>Test sequence
atcatcgatcgatgcatcgatcgtagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcgatcgtacgggcggggcggatcggct
>foo1.p1k
atcatcgatcgatgc--cgatcgtagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcgatcgtacgggcggggcggatcggct
>foo4.p1k
atcatcgatcgatgcatcgaaaatagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcgatcgtacgggcggggcggatcggct
>foo5.p1k
atcatcgatcgatgcatcgatcgtagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcgatcgtacgggcggggcggatcggct
>foo8.q1k
---------cgatgcatcgatcgtagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcgatcgtacgggcggggcggatcggct
>foo2.q1k
atcatcgatcgatgcatcgatcgtagcgagcagctactatcgatcgatcgatcactacgtacgatcgatcgatcg-------------------------
As an optimisation step, the original consensus file, hitseries.out, and positional_alignments.dat are used to find all the seed reads that match the consensus exactly, and are removed from the next stage of the analysis, but obviously kept as a reference so that variation frequencies can be calculated (such as foo5.p1k, foo8.q1k and foo2.q1k in the example above). These reads are stored in the <strain>_consensus_reads.dat file.

The hashed quality index, the consensus-identical reads and the gapped alignment reads (positional_alignments.dat) are then used by the HitSeries.pm->process() subroutine to produce validated 'ascriptions' for each window position. This is the most computationally intensive section of the algorithm.

Each base of the 100mer gapped alignments is compared in sliding 20mer windows to the corresponding consensus 20mer (if the read ID is not listed in <strain>_consensus_reads.dat). At each position in the 20mer window, the algorithm records the position, consensus letter, quality, target letter and read name only if the quality is above the minimum quality threshold. In this way, the 20mer representing the consensus window can be extended (by inserting dashes) upon finding an insertion in a seed read, and similarly the seed window can be extended if a deletion is found, resulting in precise determination of indels. Each 20mer window ascription set is stored in a <position>_ascriptions.dat file, in the form of a set of arrays comprising the relative window position, the consensus letter, the quality score, the read letter and the read ID. An example of this output is below:
...
{5140 = >
    [0, g, 32, g, foo1.p1k],
    [1, g, 34, g, foo1.p1k],
    [2, -, 42, a, foo1.p1k],
    [3, -, 56, t, foo1.p1k],
    [4, a, 56, a, foo1.p1k],
    [5, t, 56, t, foo1.p1k],
    [6, c, 56, c, foo1.p1k],
    [7, c, 56, c, foo1.p1k],
    [8, t, 56, t, foo1.p1k],
    [9, g, 56, g, foo1.p1k],
    [10, a, 56, a, foo1.p1k],
    [11, a, 54, a, foo1.p1k],
    [12, t, 56, t, foo1.p1k],
    [13, a, 56, a, foo1.p1k],
    [14, g, 56, g, foo1.p1k],
    [15, g, 56, g, foo1.p1k],
    [16, g, 56, g, foo1.p1k],
    [17, c, 20, c, foo1.p1k],
    [18, a, 18, a, foo1.p1k],
    [19, a, 20, a, foo1.p1k]
},
{5140 = >
    [0, g, 32, g, foo2.q1k],
    [1, g, 34, g, foo2.q1k],
    [2, -, 42, a, foo2.q1k],
    [3, -, 56, t, foo2.q1k],
    [4, a, 56, a, foo2.q1k],
    [5, t, 56, t, foo2.q1k],
    [6, c, 56, c, foo2.q1k],
    [7, c, 56, a, foo2.q1k],
    [8, t, 56, t, foo2.q1k],
    [9, g, 56, g, foo2.q1k],
    [10, a, 56, a, foo2.q1k],
    [11, a, 56, a, foo2.q1k],
    [12, t, 56, t, foo2.q1k],
    [13, a, 56, a, foo2.q1k],
    [14, g, 56, g, foo2.q1k],
    [15, g, 56, g, foo2.q1k],
    [16, g, 56, g, foo2.q1k],
    [17, c, 56, c, foo2.q1k],
    [18, a, 56, a, foo2.q1k],
    [19, a, 24, a, foo2.q1k]
}
...
Finally, the ascription files are processed and each position is called for variation. This step records:
  • pos - the actual position on the consensus that relates to this 20mer
  • consensus - the consensus letter
  • good_hits - the number of validated ascribed bases
  • total_mismatch - the total number of reads that have mismatching bases
  • max_mismatch - the maximum number of reads with a detected variation
  • m_dict - a histogram recording the frequency of each variation sequence (single letter for SNPs and pSNPs, dash for deletions and a string longer than 1 for insertions)
  • consensus_reads - the list of reads that match the consensus at this position
  • non_consensus_reads - the list of reads that do not match the consensus
  • dels - the number of detected deletions
  • del_list - the list of positions with deletions (if any)
  • del_reads - the list of reads with deletions (if any)
  • ins - the number of detected insertions
  • ins_list - the list of positions with insertions (if any)
  • ins_reads - the list of reads with insertions (if any)
  • error_rate - the combined error rate for all variations at this position
This output is stored in the agreement.dat file. An example of the output of this step is below:
...
pos => 3630
consensus => t
good_hits => 27
total_mismatch => 16
max_mismatch => 16
m_dict => $VAR1 = {'-' => 0,'c' => 0,'n' => 0,'a' => 0,'g' => 0,'gttaa' => 8,'t' => 19};
consensus_reads => $VAR1 = ['gnl|ti|1250552624','gnl|ti|1250549048','gnl|ti|1250549240','gnl|ti|1250549273','gnl|ti|1250549770',
'gnl|ti|1250550212','gnl|ti|1250550480','gnl|ti|1250551756','gnl|ti|1250552033','gnl|ti|1250552252',
'gnl|ti|1250552615','gnl|ti|1250552777','gnl|ti|1250554812','gnl|ti|1250554999','gnl|ti|1250556640',
'gnl|ti|1250556691','gnl|ti|1250556857','gnl|ti|1250558084','gnl|ti|1250559430'];
non_consensus_reads => $VAR1 = [];
dels => 0
del_list => $VAR1 = [];
del_reads => $VAR1 = [];
ins => 8
ins_list => $VAR1 = [
    [3630,'gttaa','-----','51'],
    [3630,'gttaa','-----','51'],
    [3630,'gttaa','-----','51'],
    [3630,'gttaa','-----','56'],
    [3630,'gttaa','-----','56'],
    [3630,'gttaa','-----','56'],
    [3630,'gttaa','-----','51'],
    [3630,'gttaa','-----','43']
];
ins_reads => $VAR1 = ['gnl|ti|1250549039','gnl|ti|1250549654','gnl|ti|1250551222','gnl|ti|1250551478','gnl|ti|1250555498',
'gnl|ti|1250557019','gnl|ti|1250557187','gnl|ti|1250558732'];
error_rate => 4.03058831829047e-07
...
agreement.dat is used by the Turnip.pm->parse_agreements() subroutine to generate the txt, sql, gff and xls output files.