GenePainter v.2.0

The conservation of intron positions comprises information useful for de novo gene prediction as well as for analyzing the origin of introns. Here, we present GenePainter, a standalone tool for mapping gene structures onto protein multiple sequence alignments (MSA). Gene structures, as provided by WebScipio, are aligned with respect to the exact positions of the introns (down to nucleotide level) and intron phase. Output can be viewed in various formats, ranging from plain text to graphical output formats.

If you use GenePainter, please cite:

Björn Hammesfahr †, Florian Odronitz †, Stefanie Mühlhausen, Stephan Waack & Martin Kollmar (2013) GenePainter: a fast tool for aligning gene structures of eukaryotic protein families, visualizing the alignments and mapping gene structures onto protein structures. BMC Bioinformatics 14, 77. Open Access Highly Accessed

Usage

ruby gene_painter.rb -i <path_to_alignment> -p <path_to_genestructure_folder-files> [<options>]

-i or --inputPath to fasta-formatted multiple sequence alignment
-p or --pathPath to folder containing gene structures in YAML or GFF format
Standard output formatMark exons by '-' and introns by '|'

Options

Text-based output format

--intron-phaseMark introns by their phase instead of '|'
--phyloMark exons by '0' and introns by '1'
--spacesMark exons by space (' ') instead of '-'
--no-standard-outputSpecify to skip standard output format.
--alignmentOutput the alignment file with additional lines containing intron phases

Graphical output format

--svgDrawn a graphical representation of genes in SVG format.
--svg-format FORMATSwitch between different formats.
FORMAT must be one of "normal", "reduced" or "both"]
"normal" draws details of aligned exons and introns [default]
"reduced" focuses on common introns only
"both" draws both formats
--pdb FILEMark consensus or merged gene structure in pdb FILE
Consenus gene structure contains introns conserved in N % of all genes
Specify N with option --consensus N; [default: 80%]
Two scripts for execution in PyMol are provided:
'color_exons.py' to mark consensus exons
'color_splicesites.py' to mark splice junctions of consensus exons
--pdb-chain CHAINMark gene structures for chain CHAIN. [default: Use chain A]
--pdb-ref-prot PROTUse protein PROT as reference for alignment with pdb sequence. [default: First protein in alignment]
--pdb-ref-prot-structColor only intron positions occuring in the reference protein structure.
--treeGenerate newick tree file and SVG representation

Meta information and statistics

--consensus NMark all introns conserved in N % genes.Specify N as decimal number between 0 and 1
--mergeMerge all introns into a single exon intron pattern
--statisticsOutput additional file with statistics about common introns.
To include information about taxomony, specify '--taxomony' and '--taxonomy-to-fasta' options

Taxonomy

--taxonomy FILEUse this option to mark introns by taxonomy.
NCBI taxonomy database dump file FILE
OR Excerpt of NCBI taxonomy. Lineage must be semicolon-separated list of taxa from root to species.
--taxonomy-to-fasta FILEText-based file mapping gene structure file names to species names.
One or more genes given as semicolon-separated list and species name.
Delimiter between gene list and species name must be a colon. The species name itself must be enclosed by double quotes like this "SPECIES"
--taxonomy-common-to X,Y,ZMark introns common to taxa X,Y,Z. List must consist of at least one NCBI taxon (scientific name)
--[no-]exclusively-in-taxaMark introns occuring (not) exclusively in listed taxa.
[default: not exclusively]

Analysis and output of all or subset of data

--analyse-all-output-allAnalyse all data and provide full output [default]
--analyse-all-output-selectionAnalyse all data and provide text-based and graphical output for selection only. All introns are analysed, including those not present in selection
--analyse-selection-output-selectionAnalyse selected data and provide output for selection only
--analyse-selection-on-all-data-output-selectionAnalyse intron positions of selected data in all data and provide output for selection only. Introns present in selection are analysed in all data

Selection criteria for data and output selection

--select-allNo selection applied (default)
--selection-based-on-regex "REGEX"Regular expression applied on gene structure file names. Regex must be enclosed by double quotes
--selection-based-on-list X,Y,ZList of gene structures to be used
--selection-based-on-species SPECIESUse all gene structures associated with species. Specify also --taxonomy-to-fasta to map gene structure file names to species names

General options

-o or --outfile FILENAMEPrefix of the output files.
--path-to-output PATHPath to the location where output files should be stored.
--range START,STOPRestrict genes to range START-STOP in alignment
--[no-]delete-range(Not) Delete specified range
--keep-common-gapsKeep common gaps in alignment. This option effects only output of --alignment
--no-best-position-intronsPlot introns always onto beginning of a gap.
Default: Align introns if their position differs by alignment gaps only
--[no-]separate-introns-in-textbased-output(Not) Separate each consecutive pair of introns by an exon placeholder in text-based output formats.
Default: Separate introns unless the output lines get too long.
-h or --helpList all options available.

For a complete list of all options available, please refer to the documentation.

Changes in command line parameters from v.1.0 to v.2.0

v.1.0 parameterv.2.0 parameter
-a --alignment
-n --intron-phase
-phylo --phylo
-s --spaces
-svg WIDTH,HEIGHT FORMAT --svg and --svg-format
-start START and -stop STOP --range START,STOP
-pdb --pdb
-pdb_prot --pdb-ref-prot
-ref_prot_struct --pdb-ref-prot-struct
-consensus --consensus, no longer restricted to combination with -pdb
-f and -penalize_endgaps obsolete

Download

tgz compressed archive (click to download archive)
zip compressed archive (click to download archive)
Documentation (click to download pdf)

Archive content

gene_painter.rbMain script
lib/Folder containing library
example/Folder containing test data
tools/Folder containing additional scripts, i.e. to obtain NCBI taxonomy dump and YAML files
Documentation.pdfExhaustive documentation
READMEInformation about installation and usage of GenePainter

Version

v.2.0, June 2014Incorporation of NCBI taxonomy
v.1.0, September 2012GenePainter goes public!

link to diark
link to cymobase
link to scipio
MPG
MPI for biophysical chemistry
Uni-Goettingen