joy is an analysis and formatting program for multiple protein sequence alignments or single protein structures. It was developed to display three-dimensional (3D) structural information in a sequence alignment and help to understand the conservation of amino acids in their specific local environments. joy produces a number of files and there are a large number of options, but the defaults are usually what you will want to do to a basic alignment. First, save your alignment in a file with a suffix .ali (e.g., family.ali) and simply type:
joy family.aliIf the alignment includes entries whose 3D structure are known, the program calculates local environments (such as secondary structure and solvent accessibility), stores them in various files (see the FILES section below) and produces formatted alignments in PostScript and HTML files.
If you want to format a single protein structure, type:
joy 1abcd.pdbwhere 1abcd.pdb is a PDB formated coordinate file.
The default suffix for an alignment file is .ali.
The default suffix for a raw PDB format coordinate file is .pdb.
The default suffix for a preprocessed PDB format coordinate file is .atm.
It is recommended that your PDB file has a .pdb suffix as input to joy. joy preprocesses the PDB file (removes alternate atoms and hydrogens, selects only the first model of an NMR entry, etc) and creates a .atm file before use. Alternatively, if your PDB file is already clean, you can use the suffix .atm and joy will not preprocess the file.
If you are a new user, you should first try formatting a single structure. In the above example, joy first extracts the amino acid sequence from the ATOM records of the PDB file 1abcd.pdb (note that the SEQRES records are not read in). The sequence is stored in the file 1abcd.ali. This is the format that joy uses and you should become familiar with (see FORMAT OF THE ALIGNMENT FILE for details). joy then looks for various data files that store information about the local environments of the structure. The first time you run the program, there is normally no data file and joy creates these automatically. Each data file has a specific suffix (see FILES). Finally, formatted alignments are produced (1abcd.html and 1abcd.ps).
When the input file is an alignment, what joy does next depends on the existence of various files, but you should get something on standard output. First the .ali file is read and parsed. For each entry in the alignment whose 3D structure is available, the data files are searched or created as above. Then formatted alignments are produced.
Here is a summary of options classified into groups. Detailed explanations will follow.
--tem --html --ps --rtf
--key --consensus-ss --alignment-pos --nwidth --maxcodelen --fontsize --lc --seqcolour
--pscolour --psfont --seqfontsize
--feature-set --dir --seg --check --psacutoff
Output .tem file (default). Use --notem to suppress this.
Output .html file (default). Use --nohtml to suppress this.
Output .ps file (default). Use --nops to suppress this. The default PostScript output is in black and white. Use --pscolour to produce a colour PostScript file.
Output .rtf file. This is an experimental option.
Display key to joy format.
Display consensus secondary structure (default). Use --noconsensus-ss to suppress this.
Display alignment position (default). Use --noalignment-pos to suppress this. Note that the numbering based on a specific structure is not supported in HTML output.
Specify the width of the output alignment (in number of characters). The default value is 50.
Specify the maximum number of characters for sequence codes. The default value is 10, i.e., the sequence codes are truncated at the tenth character. To retain all the characters, specify zero or a negative number.
Specify the fontsize. The default value is 10. In HTML output, all sequences are displayed with the same fontsize. In PS output, sequence-only entries (those with no structural information) are displayed in a smaller font, which can be controlled by the --seqfontsize option.
Display sequences with no structural information in lowercase. By default, all sequences are displayed in uppercase letters.
Specify the sequence colouring scheme. By default, the residues of sequence-only entries are coloured using the Taylor colours (Protein Eng. 10:743-746 (1997)). This feature helps examine amino acid conservation along with the structural environments of one or more homologues. Other colours can be chosen by specifying an integer n
Produce colour PostScript. See the KEY TO FORMATTED ALIGNMENTsection below. The default is --nopscolour (black and white).
Specify font for PostScript output, where font can be times, courier or helvetica. The default is times.
Specify the fontsize for sequence-only entries. The default value is 8. See also --fontsize.
Select one of the predefined sets of environments (which in turn can be defined by a combination of particular structural features). You normally do not have to use this option unless you interpret the .tem file for further analysis. See the LIST OF LOCAL ENVIRONMENTS section for more information. Note that if you specify a feature-set other than default, no HTML or PS output will be produced.
Change directory to search for data files. By default, all the data files are assumed to be in the current directory.
Display a segment alignment. A segment alignment contains entries that consist of a contiguous fragment of a larger structure. For example, an alignment can be constructed for the first blade of one beta-propeller protein, the second blade of another beta-propeller protein, and so on. This options allows the formatting of this type of alignments by using the environments calculated from the original data files, i.e., those corresponding to the entire structures. If this option is specified, the structure line (the line below >P1;...) for each sequence is parsed and it is assumed that the alignment contains only the segment specified by this line. The structural information is taken from the corresponding data files, which may include additional parts of the protein. This option is useful, for example, when you generate a structure-based alignment for the protomers of an oligomeric protein but want to display the structural information (accessibility, H-bonds, etc) in their original biological unit.
Check data for consistency (default). Use --nocheck to skip this operation (not recommended).
Specify the accessibility cutoff (see the FILES section below). The default value is 7.0.
To produce a formatted alignment, you must have a set of data files for each sequence in the alignment. These must have the same name as the title of the sequence in the alignment and be present in either the current directory or the one specified by the --dir option. If needed joy can create all the data files automatically. This feature relies on the presence of the programs hbond and sstruc in your command search path.
The accessibility data comes from a file with the suffix .psa, which can be produced by joy itself. By default the cutoff value for deciding if a residue is inaccessible is a relative total sidechain accessibility of 7%; this can be changed by using the command line option --psacutoff value.
The secondary structure and mainchain torsion data come from the .sst file, which is produced by the sstruc written by David Smith.
The hydrogen bond data comes from a file with the suffix .hbd, which is produced by the program hbond written by John Overington. You are strongly discouraged to try interpreting the contents of this file directly and use them for your own analysis.
In addition to the annotated alignments, joy produces a file with a .tem suffix. This is the main input to the program melody, which converts the alignment into a profile used by the sequence-structure homology detection program fugue. See FUGUE home page for more information. The .tem file is also the main input to the program subst, which derives environment-specific substitution tables.
Here is a summary of the default file suffixes used by joy.
|pdb||Raw PDB format coordinates|
|atm||Processed PDB format coordinates|
|sst||Secondary structure data|
|ps||PostScript file containing annotated alignment|
|html||HTML file containing annotated alignment|
|tem||File containing a `template' representation of structure|
joy takes an input alignment (or a single sequence) in a format similar to that of the NBRF/PIR format (see the example below). Several extensions have been introduced to allow easy labelling of the alignment and mixing sequence and structural information. The same alignment format is used in the modelling program MODELLER. The only restriction is that the sequences are aligned, i.e., all sequences (including N- and C-terminal gaps) should be the same length.
An example alignment file (for the crystallin family) is given below:
C; family: crystallin C; class: all beta >N1;!4gcr >P1;4gcr structureX:4gcr: 1 : : 174 : :gamma-II crystallin:Bos taurus: 1.50:18.10 --GKITFYEDRGFQGHCYECSS-DCPNLQP-----YFSRCNSIRVD-SGCWMLYERPNYQGHQYFLRRGDYPDYQ QWMGF--NDSIRSCRLIPQHTGTFRMRIYERDDFRGQMSEITD-DCP--SLQDRFHLT-EVHSLNVLEGSWVLYE MPSYRGRQYLLRPGEYRRYLDWGAMNAKVGSLRRVMDFY-* >P1;1a45 structureX:1a45: 1 : : 174 : :gamma-IV crystallin:Bos taurus: 2.30:18.60 --GKITFYEDRGFQGRHYECSS-DHSNLQP-----YFSRCNSIRVD-SGCWMLYEQPNFQGPQYFLRRGDYPDYQ QWMGL--NDSIRSCRLIP-HTGSHRLRIYEREDYRGQMVEITE-DCS--SLHDRFHFS-EIHSFNVLEGWWVLYE MTNYRGRQYLLRPGDYRRYHDWGATNARVGSLRRAVDFY-* >P1;2bb2 structureX:2bb2: -2 : : 175 : :beta-B2 crystallin:Bos taurus: 2.10:18.60 LNPKIIIFEQENFQGHSHELNG-PCPNLKET----GVEKAGSVLVQ-AGPWVGYEQANCKGEQFVFEKGEYPRWD SWTSSRRTDSLSSLRPIKVDSQEHKITLYENPNFTGKKMEVIDDDVP--SFHAHGYQE-KVSSVRVQSGTWVGYQ YPGYRGLQYLLEKGDYKDSGDFGAPQPQVQSVRRIRDMQW*
Here is the details of the format:
Introduces a comment
Introduces a comment
These comments are ignored with the exception in which the token family: follows a C; or #. In this case, the text after the family: (in this example, crystallin) is used for the title of the output alignment (currently supported only in HTML).
Denotes that the alignment will be labelled according to the PDB residue numbers of the given entry
Denotes a sequence with the name code
sequence means a sequence with no structural information, structure means that there is structural information for the sequence and thus the output alignment will include annotations. (For compatibility reasons, the token structure can also be structureM, structureX and structureN, indicating model, x-ray and NMR, respectively.)
In the above example, the comma-separated fields after the structureX is required only for the WWW version of joy. These are used to extract a relevant segment of the PDB file from our local database (the WWW server requires that all the structure entries in the user-supplied alignment have atomic coordinates in the PDB). These fields are also required when the --seg option is specified. The format of this line is:
structure[X|N]:code:start_res:start_chain:final_res:final_chain:where code is the four-letter PDB entry code, start_res is the PDB residue number of the starting residue, start_chain is the chain identifier of the starting residue, final_res is the PDB residue number of the final residue and final_chain is the chain identifier of the final residue. For the command-line version of joy, this line can simply be either structure or sequence except for the use with the --seg option.
The easiest way to generate this line is to run joy for individual structures (it will produce .ali file with a correct structure line), or use use the atm2seq server.
Character sequences can be mixed with amino acid sequences:
Indicates text data will follow
must follow previous line
Dots ('.') in a character sequence will be replaced with spaces in the formatted alignment. Both sequence and text data must be terminated by an asterisk ('*').
In the usual NBRF/PIR format, gaps are indicated by hyphens ('-'), but joy also accepts slashes ('/'). The latter is translated to a space in the output alignment and is normally used to indicate a chain break (e.g., missing electron density, multi-chain alignments). Remember the program does no alignment; what you put in is what you get out.
The Key to the alignment depends on the output format chosen by the user. Below is the Key for the HTML output.
|solvent accessible||lower case||x|
|solvent inaccessible||UPPER CASE||X|
|hydrogen bond to main-chain amide||bold||x|
|hydrogen bond to main-chain carbonyl||underline||x|
|positive phi torsion angle||italic||x|
If a colour PostScript output is chosen, in addition to these: hydrogen bond to other sidechain is indicated by a tilde (~) over the amino acid concerned, and cis peptide by a breve over the amino acid concerned. In a black and white colour PostScript file, secondary structures are distinguished by different shades.
By default, alignment positions are shown at the top. This can be turned off by the --noalignment-pos option. If the .ali file contains the
>N1;!codeline, the alignment is labelled according to the PDB residue number of the entry code rather than the alignment position.
In addition, underneath the alignment is the consensus secondary structure. The definition of `consensus' is that a fraction of greater than 0.7 is in a particular conformational state at a given position. The consensus output can be turned off by the --noconsensus-ss option.
This section is intended for those who develop software to parse .tem files. joy currently defines four sets of environments and these can be specified by the --feature-set option.
extended set of environments including hydrogen bond to mainchain (merging NH and CO), hydrogen bond (merging all) and secondary structure (merging positive phi and coil).
The following features in joy 4.0 are not supported at the moment. Use the previous version if these are required.
PS output in landscape
Produce a MNYFIT input file
K. Mizuguchi, C.M. Deane, T.L. Blundell, M.S. Johnson and J.P. Overington. (1998) JOY: protein sequence-structure representation and analysis. Bioinformatics 14:617-623.