JOY

Name

joy  --  A tool for protein sequence-structure representation and analysis (version 5.0)

Synopsis

joy [options] [file]

DESCRIPTION

joy is an analysis and formatting program for multiple protein sequence alignments or single protein structures. It was developed to display three-dimensional (3D) structural information in a sequence alignment and help to understand the conservation of amino acids in their specific local environments. joy produces a number of files and there are a large number of options, but the defaults are usually what you will want to do to a basic alignment. First, save your alignment in a file with a suffix .ali (e.g., family.ali) and simply type:

	joy family.ali
	
If the alignment includes entries whose 3D structure are known, the program calculates local environments (such as secondary structure and solvent accessibility), stores them in various files (see the FILES section below) and produces formatted alignments in PostScript and HTML files.

If you want to format a single protein structure, type:

	joy 1abcd.pdb
	
where 1abcd.pdb is a PDB formated coordinate file.

It is recommended that your PDB file has a .pdb suffix as input to joy. joy preprocesses the PDB file (removes alternate atoms and hydrogens, selects only the first model of an NMR entry, etc) and creates a .atm file before use. Alternatively, if your PDB file is already clean, you can use the suffix .atm and joy will not preprocess the file.

If you are a new user, you should first try formatting a single structure. In the above example, joy first extracts the amino acid sequence from the ATOM records of the PDB file 1abcd.pdb (note that the SEQRES records are not read in). The sequence is stored in the file 1abcd.ali. This is the format that joy uses and you should become familiar with (see FORMAT OF THE ALIGNMENT FILE for details). joy then looks for various data files that store information about the local environments of the structure. The first time you run the program, there is normally no data file and joy creates these automatically. Each data file has a specific suffix (see FILES). Finally, formatted alignments are produced (1abcd.html and 1abcd.ps).

When the input file is an alignment, what joy does next depends on the existence of various files, but you should get something on standard output. First the .ali file is read and parsed. For each entry in the alignment whose 3D structure is available, the data files are searched or created as above. Then formatted alignments are produced.

OPTIONS

Here is a summary of options classified into groups. Detailed explanations will follow.

Output file type

--tem --html --ps --rtf

Output control for both HTML and PS

--key --consensus-ss --alignment-pos --nwidth --maxcodelen --fontsize --lc --seqcolour

Output control for HTML

--bgcolor

Output control for PS

--pscolour --psfont --seqfontsize

General Options

--feature-set --dir --seg --check --psacutoff

Output file type

--tem

Output .tem file (default). Use --notem to suppress this.

--html

Output .html file (default). Use --nohtml to suppress this.

--ps

Output .ps file (default). Use --nops to suppress this. The default PostScript output is in black and white. Use --pscolour to produce a colour PostScript file.

--rtf

Output .rtf file. This is an experimental option.

Output control for both HTML and PS

--key

Display key to joy format.

--consensus-ss

Display consensus secondary structure (default). Use --noconsensus-ss to suppress this.

--alignment-pos

Display alignment position (default). Use --noalignment-pos to suppress this. Note that the numbering based on a specific structure is not supported in HTML output.

--nwidth n

Specify the width of the output alignment (in number of characters). The default value is 50.

--maxcodelen n

Specify the maximum number of characters for sequence codes. The default value is 10, i.e., the sequence codes are truncated at the tenth character. To retain all the characters, specify zero or a negative number.

--fontsize n

Specify the fontsize. The default value is 10. In HTML output, all sequences are displayed with the same fontsize. In PS output, sequence-only entries (those with no structural information) are displayed in a smaller font, which can be controlled by the --seqfontsize option.

--lc

Display sequences with no structural information in lowercase. By default, all sequences are displayed in uppercase letters.

--seqcolour n

Specify the sequence colouring scheme. By default, the residues of sequence-only entries are coloured using the Taylor colours (Protein Eng. 10:743-746 (1997)). This feature helps examine amino acid conservation along with the structural environments of one or more homologues. Other colours can be chosen by specifying an integer n

0: no colour
1: clustalx colours
2: Zappo colours
3: Taylor colours

Output control for HTML

--bgcolor name

Specify the background colour for HTML output.

Output control for PS

--pscolour

Produce colour PostScript. See the KEY TO FORMATTED ALIGNMENTsection below. The default is --nopscolour (black and white).

--psfont font

Specify font for PostScript output, where font can be times, courier or helvetica. The default is times.

--seqfontsize n

Specify the fontsize for sequence-only entries. The default value is 8. See also --fontsize.

General Options

--feature-set default | ext | j216 | j4

Select one of the predefined sets of environments (which in turn can be defined by a combination of particular structural features). You normally do not have to use this option unless you interpret the .tem file for further analysis. See the LIST OF LOCAL ENVIRONMENTS section for more information. Note that if you specify a feature-set other than default, no HTML or PS output will be produced.

--dir dir

Change directory to search for data files. By default, all the data files are assumed to be in the current directory.

--seg

Display a segment alignment. A segment alignment contains entries that consist of a contiguous fragment of a larger structure. For example, an alignment can be constructed for the first blade of one beta-propeller protein, the second blade of another beta-propeller protein, and so on. This options allows the formatting of this type of alignments by using the environments calculated from the original data files, i.e., those corresponding to the entire structures. If this option is specified, the structure line (the line below >P1;...) for each sequence is parsed and it is assumed that the alignment contains only the segment specified by this line. The structural information is taken from the corresponding data files, which may include additional parts of the protein. This option is useful, for example, when you generate a structure-based alignment for the protomers of an oligomeric protein but want to display the structural information (accessibility, H-bonds, etc) in their original biological unit.

--check

Check data for consistency (default). Use --nocheck to skip this operation (not recommended).

--psacutoff value

Specify the accessibility cutoff (see the FILES section below). The default value is 7.0.

FILES

To produce a formatted alignment, you must have a set of data files for each sequence in the alignment. These must have the same name as the title of the sequence in the alignment and be present in either the current directory or the one specified by the --dir option. If needed joy can create all the data files automatically. This feature relies on the presence of the programs hbond and sstruc in your command search path.

The accessibility data comes from a file with the suffix .psa, which can be produced by joy itself. By default the cutoff value for deciding if a residue is inaccessible is a relative total sidechain accessibility of 7%; this can be changed by using the command line option --psacutoff value.

The secondary structure and mainchain torsion data come from the .sst file, which is produced by the sstruc written by David Smith.

The hydrogen bond data comes from a file with the suffix .hbd, which is produced by the program hbond written by John Overington. You are strongly discouraged to try interpreting the contents of this file directly and use them for your own analysis.

In addition to the annotated alignments, joy produces a file with a .tem suffix. This is the main input to the program melody, which converts the alignment into a profile used by the sequence-structure homology detection program fugue. See FUGUE home page for more information. The .tem file is also the main input to the program subst, which derives environment-specific substitution tables.

Here is a summary of the default file suffixes used by joy.

suffix contents
aliInput alignment
pdbRaw PDB format coordinates
atmProcessed PDB format coordinates
hbdHydrogen-bonding data
psaAccessibility data
sstSecondary structure data
psPostScript file containing annotated alignment
htmlHTML file containing annotated alignment
temFile containing a `template' representation of structure
fugfugue profile

FORMAT OF THE ALIGNMENT FILE

joy takes an input alignment (or a single sequence) in a format similar to that of the NBRF/PIR format (see the example below). Several extensions have been introduced to allow easy labelling of the alignment and mixing sequence and structural information. The same alignment format is used in the modelling program MODELLER. The only restriction is that the sequences are aligned, i.e., all sequences (including N- and C-terminal gaps) should be the same length.

An example alignment file (for the crystallin family) is given below:

C; family: crystallin
C; class: all beta
>N1;!4gcr
>P1;4gcr
structureX:4gcr:   1 : : 174 : :gamma-II crystallin:Bos taurus: 1.50:18.10
--GKITFYEDRGFQGHCYECSS-DCPNLQP-----YFSRCNSIRVD-SGCWMLYERPNYQGHQYFLRRGDYPDYQ
QWMGF--NDSIRSCRLIPQHTGTFRMRIYERDDFRGQMSEITD-DCP--SLQDRFHLT-EVHSLNVLEGSWVLYE
MPSYRGRQYLLRPGEYRRYLDWGAMNAKVGSLRRVMDFY-*
>P1;1a45
structureX:1a45:   1 : : 174 : :gamma-IV crystallin:Bos taurus: 2.30:18.60
--GKITFYEDRGFQGRHYECSS-DHSNLQP-----YFSRCNSIRVD-SGCWMLYEQPNFQGPQYFLRRGDYPDYQ
QWMGL--NDSIRSCRLIP-HTGSHRLRIYEREDYRGQMVEITE-DCS--SLHDRFHFS-EIHSFNVLEGWWVLYE
MTNYRGRQYLLRPGDYRRYHDWGATNARVGSLRRAVDFY-*
>P1;2bb2
structureX:2bb2:  -2 : : 175 : :beta-B2 crystallin:Bos taurus: 2.10:18.60
LNPKIIIFEQENFQGHSHELNG-PCPNLKET----GVEKAGSVLVQ-AGPWVGYEQANCKGEQFVFEKGEYPRWD
SWTSSRRTDSLSSLRPIKVDSQEHKITLYENPNFTGKKMEVIDDDVP--SFHAHGYQE-KVSSVRVQSGTWVGYQ
YPGYRGLQYLLEKGDYKDSGDFGAPQPQVQSVRRIRDMQW*
	

Here is the details of the format:

C;

Introduces a comment

#

Introduces a comment

These comments are ignored with the exception in which the token family: follows a C; or #. In this case, the text after the family: (in this example, crystallin) is used for the title of the output alignment (currently supported only in HTML).

>N1;!code

Denotes that the alignment will be labelled according to the PDB residue numbers of the given entry

>P1;code

Denotes a sequence with the name code

sequence | structure

sequence means a sequence with no structural information, structure means that there is structural information for the sequence and thus the output alignment will include annotations. (For compatibility reasons, the token structure can also be structureM, structureX and structureN, indicating model, x-ray and NMR, respectively.)

In the above example, the comma-separated fields after the structureX is required only for the WWW version of joy. These are used to extract a relevant segment of the PDB file from our local database (the WWW server requires that all the structure entries in the user-supplied alignment have atomic coordinates in the PDB). These fields are also required when the --seg option is specified. The format of this line is:

	structure[X|N]:code:start_res:start_chain:final_res:final_chain:
where code is the four-letter PDB entry code, start_res is the PDB residue number of the starting residue, start_chain is the chain identifier of the starting residue, final_res is the PDB residue number of the final residue and final_chain is the chain identifier of the final residue. For the command-line version of joy, this line can simply be either structure or sequence except for the use with the --seg option.

The easiest way to generate this line is to run joy for individual structures (it will produce .ali file with a correct structure line), or use use the atm2seq server.

Character sequences can be mixed with amino acid sequences:

>T1;code

Indicates text data will follow

text

must follow previous line

Dots ('.') in a character sequence will be replaced with spaces in the formatted alignment. Both sequence and text data must be terminated by an asterisk ('*').

In the usual NBRF/PIR format, gaps are indicated by hyphens ('-'), but joy also accepts slashes ('/'). The latter is translated to a space in the output alignment and is normally used to indicate a chain break (e.g., missing electron density, multi-chain alignments). Remember the program does no alignment; what you put in is what you get out.

KEY TO FORMATTED ALIGNMENT

The Key to the alignment depends on the output format chosen by the user. Below is the Key for the HTML output.

alpha helixredx
beta strandbluex
310 helixmaroonx
solvent accessiblelower casex
solvent inaccessibleUPPER CASEX
hydrogen bond to main-chain amideboldx
hydrogen bond to main-chain carbonylunderlinex
disulfide bondcedilla
positive phi torsion angleitalicx

If a colour PostScript output is chosen, in addition to these: hydrogen bond to other sidechain is indicated by a tilde (~) over the amino acid concerned, and cis peptide by a breve over the amino acid concerned. In a black and white colour PostScript file, secondary structures are distinguished by different shades.

By default, alignment positions are shown at the top. This can be turned off by the --noalignment-pos option. If the .ali file contains the

>N1;!code
line, the alignment is labelled according to the PDB residue number of the entry code rather than the alignment position.

In addition, underneath the alignment is the consensus secondary structure. The definition of `consensus' is that a fraction of greater than 0.7 is in a particular conformational state at a given position. The consensus output can be turned off by the --noconsensus-ss option.

LIST OF LOCAL ENVIRONMENTS

This section is intended for those who develop software to parse .tem files. joy currently defines four sets of environments and these can be specified by the --feature-set option.

default

To be added

ext

extended set of environments including hydrogen bond to mainchain (merging NH and CO), hydrogen bond (merging all) and secondary structure (merging positive phi and coil).

j4

environments used in joy 4.0.

j216

environments used in joy 216.

CAVEATS

The following features in joy 4.0 are not supported at the moment. Use the previous version if these are required.

  1. LATEX output

  2. PS output in landscape

  3. Produce a MNYFIT input file

  4. Efimov output

ENVIRONMENT

None.

DIAGNOSTICS

To be added.

BUGS

Many.

AUTHOR

Kenji Mizuguchi

COPYRIGHT

Copyright 1998-2001 Kenji Mizuguchi

REFERENCE

K. Mizuguchi, C.M. Deane, T.L. Blundell, M.S. Johnson and J.P. Overington. (1998) JOY: protein sequence-structure representation and analysis. Bioinformatics 14:617-623.

SEE ALSO

fugue (1) subst (1) JOY version 5 Step-by-step instructions for running JOY