GenePainter

The conservation of intron positions comprises information useful for de novo gene prediction as well as for analyzing the origin of introns. Here, we present GenePainter, a standalone tool for mapping gene structures onto protein multiple sequence alignments (MSA). Gene structures, as provided by WebScipio, are aligned with respect to the exact positions of the introns (down to nucleotide level) and intron phase. Output can be viewed in various formats, ranging from plain text to graphical output formats.

Usage

ruby gene_painter.rb -i <alignment> -p <yaml-files> [<options>]

<alignment>MSA in fasta format.
<yaml-files>Path to the directory containing the gene structures in yaml-format. Please note: Only those genes will be analyzed for which both a gene structure and sequence in the MSA are available.

Options

-o <project_name>Base name of the output file(s) (default: 'genepainter').
-aOutput the alignment file with additional lines containing intron phases.
-nMark introns by intron phase instead of the vertical bar "|"
-phyloFor phylogenetic analysis: Mark exons and introns by "0" and "1", respectively. The result will be saved as FASTA.
-sMark non introns by spaces instead of "-"
-svg <width> <height> [extended|normal|reduced]Create an SVG-file of size <width> x <height>
Use the additional parameter extended to create a more detailed svg.
Use the additional parameter normal to create the normal svg.
Use the additional parameter reduced to create a svg long on introns.
-start <value>Alignment position to start (default: position 1).
-stop <value>Alignment position to stop (default: last position).
-pdb <file> [chain]Two scripts for execution in PyMol will be provided. In color_exons.py the consensus exons are colored and in color_splicesites.py the splice junctions of the consensus exons are marked for <chain> (default: chain A).
-pdb_prot <prot_name>Use protein <prot_name> as reference for alignment with the pdb sequence (default: first protein in <alignment>).
-fForce alignment between pdb and first protein sequence of the MSA or protein <prot_name> (if specified). This ignores the default that intron positions will only be mapped if the alignment score > 70%.
-consensus <value>Color only intron positions conserved in <value> percent of all genes (default: 80%)
-ref_prot_structColor only the intron positions occurring in the gene of the reference protein. May be combined with "-consensus".
-penalize_endgapsPenalize gaps at the end of the alignment, behaving like the standard Needleman-Wunsch algorithm (default: gaps at the end of the alignment are not penalized).


Examples

$ruby gene_painter.rb -i test_data/coronin_alignment.fas -p test_data/coronin_genes/ \
  -o coronin -svg 1000 500 extended -pdb test_data/2AQ5.pdb -pdb_prot HsCoro1A
$ruby gene_painter.rb -i test_data/coronin_alignment.fas -p test_data/coronin_genes/ \
  -o coronin -svg 1000 500 reduced


Input

Gene structuresGene structures need to be in yaml-format as obtained by WebScipio. Further information about this format are listed here.
AlignmentFasta-formatted multiple sequence alignment. The fasta header must not contain any blanks or special characters, as they must be exactly identical to the filenames of corresponding YAML files.


Meaning of the parameters

The following figures illustrate some of GenePainters output formats and options. All figures were generated with test data comprising coronin genes as included in the archive.

The basic output format is a plain text-file where exons are represented as minus signs and introns as vertical bars (Figure 1A). By using the -s option (Figure 1B), only introns are represented by "|". A more detailed output including intron phases can be obtained by using the -n option (Figure 1C). Moreover, intron phases can be included as additional lines in the given alignment (Figure 2A; option -a), or an alignment based on the presence 1 and absence 0 of introns for further phylogenetic analyses can be generated (Figure 2B; option -phylo).

Figure 1

click to view larger picture

Figure 2

click to view larger picture

Apart from text based output formats, a graphical output can be generated. To this end, SVG parameters -svg <width> <height> [extended|normal|reduced] need to be set. Figure 3A illustrates a detailed SVG of size 1000 x 500 pixel (extended is set). In contrast, the basic SVG (same size, but normal is set) is shown in Figure 3B. In Firgure 3C the reduced SVG (same size, but reduced is set) is shown. In order to restrict the algorithm to a certain part of the alignment, first and last positions to be considered can be set by -start and -stop, respectively. Figure 4 pictures a detailed SVG based on only part of the given alignment (domain of interest, alignment positions 1-612).

Figure 3

click to view larger picture

Figure 4

click to view larger picture

Additionally, if a pdb file is specified via -pdb, intron positions and phases are mapped onto protein structure. Figure 5 demonstrates mapping of the exons of the human coronin HsCoro1A gene (-pdb_prot HsCoro1A) onto the protein structure of mouse coronin MmCoro1A (pdb file is part of the test data set, -pdb test_data/2Aq5.pdb). While for this figure all exons that are conserved in at least 80% of all proteins are considered (default), Figure 6 displays all exons present in the reference sequence (-ref_prot_struct). Accordingly, splice sites are shown in Figures 7 and 8. In this output, attention is drawn to intron phases. A three-color scheme and numbers denote phases.

Figure 5

click to view larger picture

Figure 6

click to view larger picture

Figure 7

click to view larger picture

Figure 8

click to view larger picture

Part of the underlying algorithm is the calculation of a global alignment between reference and pdb sequence. Although this alignment is a simple implementation of the Needleman-Wunsch algorithm, some adjustments are done. In detail, gaps at the end of the alignment are not penalized. This is particular useful, as pdb sequence and reference sequence may vary in length. Alignments with and without free end gaps are opposed in Figure 9.

Figure 9

click to view larger picture

Download

tgz compressed archive (click to download archive)
zip compressed archive (click to download archive)

Version

1.0GenePainter goes public!

link to diark
link to cymobase
link to scipio
MPG
MPI for biophysical chemistry
Uni-Goettingen