Annotate Documentation

Origins

Annotate is one of the programs using code built in Massey and MSKCC, including Peek3. Annotate is designed to allow the merging of information about a sequence or set of related sequences calculated by external programs, and also to calculate a few functions of it's own.

The latest bleeding-edge version of annotate is found in xray0/bin/annotate. Unlike the rest of the Ogre suite, you cannot copy Annotate and redistribute it unless you get permission from Nikola - his algorithm for structure propensity is not public domain.

Caveat

Annotate is very much still under development. It's function and it's syntax has a nasty habit of changing. The built-in documentation is minimal, but the use of the "?" command will often bring up a list of options.

Introduction

Annotate reads single or multiple-sequence alignments. Annotate is designed to mark-up (i.e. annotate) these sequences with properties designed for these sequences: secondary structure predictions (PhD, PROF), secondary structure (DSSP), structure propensity (built-in implementation). It can average certain properties and can output them in various formats.

Annotate stores data in objects. Objects have no mystical property, they are just the thing that Annotate stores the data in. Each object has a name, assigned automatically, that can be used to distinguish between multiple objects. There can be multiple objects in Annotate but at the present time I just recommend dealing with one at a time. A new object is created each time you invoke the read command, so each sequence alignment is in a different object.

Each time more derived data (DSSP, PHD output etc.) is added to a sequence object, it is put in a property. The name of the property is supposed to be related to what the data is, but the creativity of the programmer is always limited. Properties are associated with each sequence within an object, not just with the object itself, so you can load and store multiple PhD runs for multiple different sequences within an object. If you load multiple PhD runs for the same sequence, then multiple properties will be created, probably with the same name, and mayhem may result.

Annotate does not do sequence alignments so there is no way to update or build a multiple sequence alignment within the program. Use CLUSTALX or DNASTAR for that purpose. This was a design decision within the program, since multiple sequence alignments are a challenge and still major research projects for bioinformatics labs. Unless I wake up one morning and have huge insight as to how to implement multiple-sequence alignments, this lack-of-feature will remain a feature of the program.

Commands

On startup, Annotate offers only 5 commands:

read     Read a sequence/alignment
exit     Exit the program
quit     Exit the program
?     Show help info
help     Show help info

read	Read a sequence/alignment
exit	Exit the program
quit	Exit the program
?	Show help info
help	Show help info

No other commands are shown because they are unavailable: you don't have any sequences read into the program so there's no data to act on. Once you read data via the read command the more extended list is:

average Average sequence properties
conservation Calculate sequence conservation
dssp Annotate from DSSP results
dump List ALL object data
excel EXCEL-style data dump
kill Delete current object
list List object sequence data
npp Calculate structure propensity
omac Write Grasp and O macros
pdj Phil's tweaked structure propensity
phd Annotate from PHD results
plot Write data to plot file (portrait)
lplot Write data to plot file (landscape)
read Read a sequence/alignment
start Define sequence start number
write Write a sequence (alignment)
end Exit the program
exit (ditto)
quit (ditto)
? Show help info

average	Average sequence properties
conservation	Calculate sequence conservation
dssp	Annotate from DSSP results
dump	List ALL object data
excel	EXCEL-style data dump
kill	Delete current object
list	List object sequence data
npp	Calculate structure propensity
omac	Write Grasp and O macros
pdj	Phil's tweaked structure propensity
phd	Annotate from PHD results
plot	Write data to plot file (portrait)
lplot	Write data to plot file (landscape)
read	Read a sequence/alignment
start	Define sequence start number
write	Write a sequence (alignment)
end	Exit the program
exit	(ditto)
quit	(ditto)
?	Show help info

Command Description

Commands grouped according to function: reading data, annotating data, calculating data, outputting data. Reading data into the program.

Read filename Reads sequence data in a new object.
Formats: MSF/PILEUP, FASTA, BLAST, PHD
The PHD format extracts the MAXHOM alignment from PHD outputs, but does not append the PHD and PROF data. FASTA is designed to read single sequences, although it will read multiple sequences within a file. It cannot align those sequences, however. BLAST attempts to extract multiple sequence alignments from BLAST web output. It cannot handle HTML tags.
Start number Define the sequence start number.
Often multiple sequence alignments start at 1 by default, and Annotate tries to extract the start number anyway, but you can always attempt to redefine the start point if you know better.
Kill Deletes an object and all it's contents.
It is not necessary to do this before exiting the program, so it's not obvious why you would want to do this ;)


Read filename	Reads sequence data in a new object.
	Formats: MSF/PILEUP, FASTA, BLAST, PHD
	The PHD format extracts the MAXHOM alignment from PHD outputs, but does not append the PHD and PROF data. FASTA is designed to read single sequences, although it will read multiple sequences within a file. It cannot align those sequences, however. BLAST attempts to extract multiple sequence alignments from BLAST web output. It cannot handle HTML tags.
Start number	Define the sequence start number.
	Often multiple sequence alignments start at 1 by default, and Annotate tries to extract the start number anyway, but you can always attempt to redefine the start point if you know better.
Kill	Deletes an object and all it's contents.
	It is not necessary to do this before exiting the program, so it's not obvious why you would want to do this ;)

Reading derived data into the program

Phd filename Reads PhD and PROF data from a PHD run.
Attempts to deal with the vagaries of the format. Currently does not handle HTML format output correctly. I am working on this feature. Creates properties called obj.phd_aa (sequence), obj.phd (secondary structure prediction), obj.phd_rel (reliability of prediction), obj.phd_prH, obj.phd_prE, obj.phd_prL (probabilities of helix, strand, loop). The "obj" part is replaced by the name of your sequence object. Also now extracts PROF data from the same file and stores obj.prof (secondary structure prediction) and obj.prof_rel (reliability of prediction). The phd_aa property will probably vanish - it was created for debugging purposes.
Dssp filename Extracts the deduced secondary structure from DSSP
Stores it as obj.dssp_ss. Also stores sequence from the file as obj.dssp_aa, but this is a debug feature that will ultimately disappear.


Phd filename	Reads PhD and PROF data from a PHD run.
	Attempts to deal with the vagaries of the format. Currently does not handle HTML format output correctly. I am working on this feature. Creates properties called obj.phd_aa (sequence), obj.phd (secondary structure prediction), obj.phd_rel (reliability of prediction), obj.phd_prH, obj.phd_prE, obj.phd_prL (probabilities of helix, strand, loop). The "obj" part is replaced by the name of your sequence object. Also now extracts PROF data from the same file and stores obj.prof (secondary structure prediction) and obj.prof_rel (reliability of prediction). The phd_aa property will probably vanish - it was created for debugging purposes.
Dssp filename	Extracts the deduced secondary structure from DSSP
	Stores it as obj.dssp_ss. Also stores sequence from the file as obj.dssp_aa, but this is a debug feature that will ultimately disappear.

Calculating derived data within the program

Npp exponent Nikola's structure propensity plot.
You need to supply the exponent value for the smoothing (default is 1.2). Creates obj.sp_smooth for the smoothed data and obj.sp_smoothexp for the exponentially smoothed data. The version in this program corresponds to 3.6 of Nikola's stand-along program.
Pdj exponent Phil's tweaked structure propensity plot.
Creates obj.pdj_smooth for the smoothed data and obj.pdj_smoothexp for the exponentially smoothed data. This is my somewhat tweaked version of Nikola's algorithm and is very much a developmental version. Currently the parameters are much the same and do not differ too much from the results of npp.
Average [npp|phd|prof] sequence Average sequence properties
Associates averaged values with the reference sequence. One can average structure propensity (NPP), PhD (PHD) and Prof (PROF) properties. It correctly handles insertions. It inherently averages all applicable properties in all sequences in the object. For PhD and Prof averaging the averaged secondary structure is that which has the greatest summed probability across the sequences.
Conservation sequence Calculate sequence conservation.
There are two styles for doing this: either with respect to the whole object or with respect to a reference sequence. Produces obj.conservation (numerical % conservation across all sequences), obj.ref_cons (numerical % conservation with respect to reference sequence), obj.consensus (most populous amino acid at that position), and obj.identity (ASCII representation of 50, 75 and 100% identity as . : and *).


Npp exponent	Nikola's structure propensity plot.
	You need to supply the exponent value for the smoothing (default is 1.2). Creates obj.sp_smooth for the smoothed data and obj.sp_smoothexp for the exponentially smoothed data. The version in this program corresponds to 3.6 of Nikola's stand-along program.
Pdj exponent	Phil's tweaked structure propensity plot.
	Creates obj.pdj_smooth for the smoothed data and obj.pdj_smoothexp for the exponentially smoothed data. This is my somewhat tweaked version of Nikola's algorithm and is very much a developmental version. Currently the parameters are much the same and do not differ too much from the results of npp.
Average [npp\|phd\|prof] sequence	Average sequence properties
	Associates averaged values with the reference sequence. One can average structure propensity (NPP), PhD (PHD) and Prof (PROF) properties. It correctly handles insertions. It inherently averages all applicable properties in all sequences in the object. For PhD and Prof averaging the averaged secondary structure is that which has the greatest summed probability across the sequences.
Conservation sequence	Calculate sequence conservation.
	There are two styles for doing this: either with respect to the whole object or with respect to a reference sequence. Produces obj.conservation (numerical % conservation across all sequences), obj.ref_cons (numerical % conservation with respect to reference sequence), obj.consensus (most populous amino acid at that position), and obj.identity (ASCII representation of 50, 75 and 100% identity as . : and *).

Output

excel sequence Writes an EXCEL-style data dump
Comma-delimited with sequence numbers relative to a reference sequence. Mostly tweaked to Nikola's preferences, since he was the one who requested the feature.
plot Write data to plot file as Postscript
In portrait orientation. I strongly recommend using landscape instead. See below. Creates test.ps.
lplot Write data to plot file as Postscript
In landscape orientation. Writes sequence data as characters, character data as characters, but writes numerical data in a simple graphed format. Highlights identical sequences but otherwise does no additional data post-processing. Creates test.ps.
omac Write Grasp and O macros
To color your molecule based on sequence conservation. Conservation cutoffs are prompted for. Could be more user-friendly.
list List object sequence data
Just lists the sequence alignment. Good for sanity checking.
dump Crude data dump
Dumps all sequences and properties. Mainly for debugging, this could also be processed by a program of your own.


excel sequence	Writes an EXCEL-style data dump
	Comma-delimited with sequence numbers relative to a reference sequence. Mostly tweaked to Nikola's preferences, since he was the one who requested the feature.
plot	Write data to plot file as Postscript
	In portrait orientation. I strongly recommend using landscape instead. See below. Creates test.ps.
lplot	Write data to plot file as Postscript
	In landscape orientation. Writes sequence data as characters, character data as characters, but writes numerical data in a simple graphed format. Highlights identical sequences but otherwise does no additional data post-processing. Creates test.ps.
omac	Write Grasp and O macros
	To color your molecule based on sequence conservation. Conservation cutoffs are prompted for. Could be more user-friendly.
list	List object sequence data
	Just lists the sequence alignment. Good for sanity checking.
dump	Crude data dump
	Dumps all sequences and properties. Mainly for debugging, this could also be processed by a program of your own.

Examples

Written in the style where Annotate runs in batch mode. You can also enter the same data interactively.

Read a multiple sequence alignment and list it to the screen:

xray0/bin/annotate << EOF
read tests/cullins.gcgpileup pileup
list
quit
EOF

Read a multiple sequence alignment and calculate structure propensity:

xray0/bin/annotate << EOF
read tests/cullins.gcgpileup pileup
list
npp 1.2
dump
quit
EOF

Add some PhD/PROF output to the above:

xray0/bin/annotate << EOF
read tests/cullins.gcgpileup pileup
list
npp 1.2
phd tests/cullins.phd
dump
quit
EOF

Finally do something useful and plot it out:

xray0/bin/annotate << EOF
read tests/cullins.gcgpileup pileup
list
npp 1.2
phd tests/cullins.phd
lplot
quit
EOF

Read a MAXHOM alignment from PhD, add the PhD/PROF predictions, add structure propensity and plot it:

xray0/bin/annotate << EOF
read tests/mouse_pred.txt phd
phd tests/mouse_pred.txt
list
npp 1.2
lplot
quit
EOF

Future Plans

A more flexible, customisable output mode
Sequence readable from PDB files (but multiple chains complicate the situation).
B-factors extracted from PDB files
Automatic parsing of HTML formats (mainly PHD option)
A version that actually works under OSX
Further development of the structure propensity algorithm