The conservation of intron positions comprises information useful for de novo gene prediction as well as for analyzing the origin of introns. Here, we present GenePainter, a standalone tool for mapping gene structures onto protein multiple sequence alignments (MSA). Gene structures, as provided by WebScipio, are aligned with respect to the exact positions of the introns (down to nucleotide level) and intron phase. Output can be viewed in various formats, ranging from plain text to graphical output formats.If you use GenePainter, please cite:
Björn Hammesfahr †, Florian Odronitz †, Stefanie Mühlhausen, Stephan Waack & Martin Kollmar (2013) GenePainter: a fast tool for aligning gene structures of eukaryotic protein families, visualizing the alignments and mapping gene structures onto protein structures. BMC Bioinformatics 14, 77.
ruby gene_painter.rb -i <path_to_alignment> -p <path_to_genestructure_folder-files> [<options>]
|-i or --input||Path to fasta-formatted multiple sequence alignment|
|-p or --path||Path to folder containing gene structures in YAML or GFF format|
|Standard output format||Mark exons by '-' and introns by '|'|
Text-based output format
|--intron-phase||Mark introns by their phase instead of '|'|
|--phylo||Mark exons by '0' and introns by '1'|
|--spaces||Mark exons by space (' ') instead of '-'|
|--no-standard-output||Specify to skip standard output format.|
|--alignment||Output the alignment file with additional lines containing intron phases|
Graphical output format
|--svg||Drawn a graphical representation of genes in SVG format.|
|--svg-format FORMAT||Switch between different formats.
FORMAT must be one of "normal", "reduced" or "both"]
"normal" draws details of aligned exons and introns [default]
"reduced" focuses on common introns only
"both" draws both formats
|--pdb FILE||Mark consensus or merged gene structure in pdb FILE
Consenus gene structure contains introns conserved in N % of all genes
Specify N with option --consensus N; [default: 80%]
Two scripts for execution in PyMol are provided:
'color_exons.py' to mark consensus exons
'color_splicesites.py' to mark splice junctions of consensus exons
|--pdb-chain CHAIN||Mark gene structures for chain CHAIN. [default: Use chain A]|
|--pdb-ref-prot PROT||Use protein PROT as reference for alignment with pdb sequence. [default: First protein in alignment]|
|--pdb-ref-prot-struct||Color only intron positions occuring in the reference protein structure.|
|--tree||Generate newick tree file and SVG representation|
Meta information and statistics
|--consensus N||Mark all introns conserved in N % genes.Specify N as decimal number between 0 and 1|
|--merge||Merge all introns into a single exon intron pattern|
|--statistics||Output additional file with statistics about common introns.
To include information about taxomony, specify '--taxomony' and '--taxonomy-to-fasta' options
|--taxonomy FILE||Use this option to mark introns by taxonomy.
NCBI taxonomy database dump file FILE
OR Excerpt of NCBI taxonomy. Lineage must be semicolon-separated list of taxa from root to species.
|--taxonomy-to-fasta FILE||Text-based file mapping gene structure file names to species names.
One or more genes given as semicolon-separated list and species name.
Delimiter between gene list and species name must be a colon. The species name itself must be enclosed by double quotes like this "SPECIES"
|--taxonomy-common-to X,Y,Z||Mark introns common to taxa X,Y,Z. List must consist of at least one NCBI taxon (scientific name)|
|--[no-]exclusively-in-taxa||Mark introns occuring (not) exclusively in listed taxa.
[default: not exclusively]
Analysis and output of all or subset of data
|--analyse-all-output-all||Analyse all data and provide full output [default]|
|--analyse-all-output-selection||Analyse all data and provide text-based and graphical output for selection only. All introns are analysed, including those not present in selection|
|--analyse-selection-output-selection||Analyse selected data and provide output for selection only|
|--analyse-selection-on-all-data-output-selection||Analyse intron positions of selected data in all data and provide output for selection only. Introns present in selection are analysed in all data|
Selection criteria for data and output selection
|--select-all||No selection applied (default)|
|--selection-based-on-regex "REGEX"||Regular expression applied on gene structure file names. Regex must be enclosed by double quotes|
|--selection-based-on-list X,Y,Z||List of gene structures to be used|
|--selection-based-on-species SPECIES||Use all gene structures associated with species. Specify also --taxonomy-to-fasta to map gene structure file names to species names|
|-o or --outfile FILENAME||Prefix of the output files.|
|--path-to-output PATH||Path to the location where output files should be stored.|
|--range START,STOP||Restrict genes to range START-STOP in alignment|
|--[no-]delete-range||(Not) Delete specified range|
|--keep-common-gaps||Keep common gaps in alignment. This option effects only output of --alignment|
|--no-best-position-introns||Plot introns always onto beginning of a gap.
Default: Align introns if their position differs by alignment gaps only
|--[no-]separate-introns-in-textbased-output||(Not) Separate each consecutive pair of introns by an exon placeholder in text-based output formats.
Default: Separate introns unless the output lines get too long.
|-h or --help||List all options available.|
Changes in command line parameters from v.1.0 to v.2.0
|v.1.0 parameter||v.2.0 parameter|
|-svg WIDTH,HEIGHT FORMAT||--svg and --svg-format|
|-start START and -stop STOP||--range START,STOP|
|-consensus||--consensus, no longer restricted to combination with -pdb|
|-f and -penalize_endgaps||obsolete|
|lib/||Folder containing library|
|example/||Folder containing test data|
|tools/||Folder containing additional scripts, i.e. to obtain NCBI taxonomy dump and YAML files|
|README||Information about installation and usage of GenePainter|
|v.2.0, June 2014||Incorporation of NCBI taxonomy|
|v.1.0, September 2012||GenePainter goes public!|