xalign - Multiple Sequence Alignment (Version 4.0)

Purpose: xalign is a graphical program which does multiple sequence alignment based on sequence homology and secondary structure. The following documentation describes the xalign program for the camra suite of programs.

  1. Introduction

  2. Preparations

  3. Running the Program


Overview

This graphical program does a multiple alignment of sequences based on a comprehensive dynamic programming algorithm. The alignment is based on amino acid similarity, secondary structure similarity, and various gapping penalties. These parameters have been generalized to align the majority of sequences in a reasonable manner.


Main Screen Snapshot


Capabilities

xalign has all of the following attributes which make it a very powerful yet relatively easy to use program:
  1. No hard limits on sequence length or number of sequences, (limited only by amount of memory).
  2. User has control of the alignment, the ability to include specific insights or knowledge.
  3. Secondary structure information can be included for any or all of the sequences.
  4. Multiple alignment has a consensus sequence.
  5. Detailed pairwise alignments can be printed.
  6. The ability to change the amino acid similarity matrix, gapping penalties, even the order in which the sequences are aligned.
  7. The user can anchor the multiple alignment at places where he/she sees fit (eg, what would the alignment look like if amino acid X and Y have to line up).
  8. The user can ensure that certain amino acids are not broken up by a gap within an alignment.


Reference

Constrained multiple sequence alignment using XALIGN Authors: David Wishart, Robert Boyko, Brian Sykes in Cabios Vol. 10 no.6 1994 Pages 687-688

Copyright (C) 1994 - No portion of this program may be incorporated into other programs or sold for profit without express written consent of the authors. Funding for this project has been provided by the Medical Research Council of Canada and the Protein Engineering Networks of Centres of Excellence (Canada).


Installation and Download

Executable versions of this program for suns or sgis are freely available at our ftp site. First you will need to download the software from our website:

Even though xalign is a standalone program, we strongly recommend getting any optional software as outlined on the download page. Once you have downloaded the software, you then proceed by uncompressing and untarring the files:

	uncompress myfile.tar.Z
	tar xvf myfile.tar

You should then take a look at the README file to understand what files are being installed and the installation options you have. After this, type "Install" to put the files in the appropriate places.

The current version of this software comes with an expiry date. If your software has expired, check out the website above for further instructions or new versions.


Basic Sequence Input

An input file contains two or more sequences to align. Although there is no maximum number of sequences you can align, you are limited by the amount of memory on the machine you are running on.

Each input sequence must contain these minimum attributes:

  1. A right angle bracket ">" signals the beginning of a sequence.
  2. An ID code for the sequence which is alphanumeric character string (1-8 characters in length). Use as descriptive a name as possible.
  3. The sequence name and other details on the remainder of the line.
  4. The amino acid sequence on all subsequent lines in one letter code notation (upper or lower case).

The number of amino acid codes per line does not matter, however, it is easier to check your input for correctness if you decide on a constant number such as 50. Also, blanks are ignored if found in the amino acid sequence. Alternative amino acid code meanings such as 'B', 'X', and 'Z' are acceptable input but they will have no scoring value during the alignment process (unless the amino acid scoring matrix is changed).

Here is an example input file:

	>CaM Calmodulin - Drosophila melanogaster (1-148)
	ADQLTEEQIA EFKEAFSLFD KDGDGTITTK ELGTVMRSLG QNPTEAELQD
	MINEVDADGN GTIDFPEFLT MMARKMKDTD SEEEIREAFR VFDKDGNGFI
	SAAELRHVMT NLGEKLTDEE VDEMIREANI DGDGQVNYEE FVTMMTSK

	>TnC Troponin C, cloned chicken skeletal muscle (1-162)
	ASMTDQQAEA RAFLSEEMIA EFKAAFDMFD ADGGGDISTK ELGTVMRMLG
	QNPTKEELDA IIEEVDEDGS GTIDFEEFLV MMVRQMKEDA KGKSEEELAN
	CFRIFDKNAD GFIDIEELGE ILRATGEHVI EEDIEDLMKD SDKNNDGRID
	FDEFLKMMEG VQ

To include secondary structure in a sequence, this information is placed on the line directly below the primary sequence (upper or lower case letters acceptable). Use "h" for helical regions, "b" for beta strand, "c" for random coil, "t" for beta turn and "x" for regions you don't know or care about.

Here is an example input file with secondary structure information included:

	>CaM Calmodulin - Drosophila melanogaster (1-148)
	ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQD
	ccccchhhhhhhhhhhhhhccccccbbbhhhhhhhhhhcccccchhhhhh
	MINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGFI
	hhhhhccccccbbbhhhhhhhhhhhhhcccchhhhhhhhhhhhcccccbb
	SAAELRHVMTNLGEKLTDEEVDEMIREANIDGDGQVNYEEFVTMMTSK
	bhhhhhhhhhhcccccchhhhhhhhhhcccccccbbbhhhhhhhhhcc

	>TnC Troponin C, cloned chicken skeletal muscle (1-162)
	ASMTDQQAEARAFLSEEMIAEFKAAFDMFDADGGGDISTKELGTVMRMLG
	cccchhhhhhhhhcchhhhhhhhhhhhhhccccccbbbhhhhhhhhhhcc
	QNPTKEELDAIIEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAN
	cccchhhhhhhhhhhccccccbbbhhhhhhhhhhhhhcccccccchhhhh
	CFRIFDKNADGFIDIEELGEILRATGEHVIEEDIEDLMKDSDKNNDGRID
	hhhhhccccccbbbhhhhhhhhhhhccccchhhhhhhhhhhccccccbbb
	FDEFLKMMEGVQ
	hhhhhhhhhhcc


Note: If you choose to enter secondary structure information, then you must enter it for all amino acids.


Advanced Sequence Input

The following section explains how the user can enter specific knowledge into the alignment process.

Sometimes xalign will insert gaps into an alignment where you think are not correct. You could change any of the various gapping penalties in the parameter file but this will likely change your entire alignment (which you may already be happy with). To prevent the program from breaking up a certain section of amino acids, just type asterisks above those amino acids. Because the program ignores blanks in the input sequence, the other amino acids without asterisks must get some kind of default character. In this case, use a "-" or dash character.

Here is an example of a sequence input file which has this amino acid clustering:

         >unkn1 unknown protein mouse 
         -------------******-------------------------------
         SRTEYDPLKFWPITHYCPHSARKDTYPERFYANMPKLDNQGPLSTYPLST
         cchhhhhhhhhchhhhhhhccccccccccbbbbccccccccccccchhhh
         ---------------
         QWPIIVDTASATLMS
         hhcbbbbbbcccccc
In the example above, the asterisks over "THYCPH" will ensure that the program will not break up these amino acids in the alignment.

Another potentially useful tool is to be able to anchor a certain amino acid in one sequence to a certain amino acid in another sequence. One can imagine a scenario where a user knows that two amino acids line up but because of remote homology, xalign can't understand the significance of that particular match.

To implement this anchoring procedure, the user specifies a number between 1-5 above the first amino acid to anchor in the first sequence. The user then specifies that same number above the second amino acid in the second sequence.

Here is an example of anchoring one amino acid in one sequence to another amino acid in another sequence:


         >unkn1 this protein unknown for mouse
         ---------1----------------------------------------
         SRTEYDPLKFWPITHYCPHSARKDTYPERFYANMPKLDNQGPLSTYPLST
         cchhhhhhhhhchhhhhhhccccccccccbbbbccccccccccccchhhh
         ---------------
         QWPIIVDTASATLMS
         hhcbbbbbbcccccc

         >unkn2 some other protein
         ---------1----------------------------------------
         SRSELDPLKFMPLPITYCGHSAREATYPERDDANMPKLENSTGPLQTYPL
         ------------------
         LSYQCPIIVDTAKHLLNS

The anchoring procedure can be applied to any number of sequences. You can also have the same anchoring number appear more than once in a sequence, the program ends up choosing the anchor which maximizes the total alignment score.


Running xalign

  1. Type "xalign".

    If you do not get a graphical window, check with your system administrator to make sure the program has been installed and is accessible to you. A common problem is that your PATH environmental variable needs to be changed to include the location of the installed xalign program.

    If you are logged in remotely, then enter the first command in the console window and the second in your remote login window:

    		xhost + remoteMachine
    		setenv DISPLAY hostMachine:0
    
    This allows xalign to run on the remote machine but the display will go to the host computer.

  2. Enter your xalign sequence datafile.

  3. Click the button indicating how sequences are to be aligned.

    • align sequences to one selected protein
    • the computer decides the order of sequences
    • align sequences in the order in which they are input

    If you click the first button, the sequences are displayed and the user clicks the sequence to align to.

    Since the multiple sequence alignment algorithm is heuristic, xalign can generate different alignments depending on the order in which the sequences are processed. The default computer algorithm is to align sequences from most to least homologous, starting first with those sequences that have structure determined. You as the user have the choice of selecting the initial sequence to align to or even deciding the complete order for processing sequences. This freedom is basically allowed for experimental purposes. Most of the time your best alignment should occur when you select the option that allows the computer to decide the alignment order.

  4. Enter your output file.

  5. Click the "execute" button.

  6. Finally click the "display results" button.


Output

The output of xalign consists of the following:
  1. program version and current date
  2. list of alignment parameters used
  3. (optional) pairwise alignments for each pair of sequences
  4. final multiple alignment with consensus sequence

The printing of pairwise alignments is an option for the user in the xalign.parms file. The bars "|" in the pairwise alignment indicate amino acids which are identical, the asterisks "*" denote amino acids which are similar. The currect amino acid number is printed at the end of each sequence line.

The "percent sequence homology" is calculated as the score of the current alignment divided by the score of the perfect alignment. It is not the number of amino acids which match over the length of the alignment.

The ranking of pairwise sequences is determined by the "alignment score". The alignment score is the percent sequence homology score plus a constant if the sequence has secondary structure determined.

The order of sequences in the multiple alignment is based on the sequences which occur first within the ranked pairwise alignments. Usually it is better to align sequences from most to least homologous starting first with those that have structure determined.

Note that the consensus sequence is based on a threshold percent identity which is set in the parameter file. If the threshold is reached, then that amino acid is printed otherwise a dash "-" is printed.


Alignment Analysis

First it is important for the user to realize that the programming model makes a number of assumptions and simplifications in order to turn multiple sequence alignment into a mathematical problem. Secondly, the user should realize that solving this particular mathematical problem "perfectly" is impractical for 3 or more sequences.

The xalign program was developed to handle the majority of alignment requests in a reasonable manner. Compromising the relatively straight forward algorithm for special classes of alignments of probable nature seemed beyond the intent of the program. Since the tools are available for the user to correct the errors, let he/she use them.

The following suggestions can help you use the xalign program to arrive at the best alignment possible. Some of these suggestions involve modifying variables (denoted as XALIGN:) in the xalign.parms file.

  1. If you see an obvious alignment mistake that can be corrected, first try the advanced sequence input file clustering or anchoring options. Most alignment problems can eventually be solved this way.

  2. Because related sequences can be so remote, it is possible that xalign is unable to find the key alignment areas. Help the program by using the anchoring capabilities available.

  3. Gapping on both sides of an amino acid can be part of an "optimal" solution though it is neither realistic nor appealing (especially if gap penalties are cheap). One solution is to increase your gap or gap size penalties found in the XALIGN:SCOR_MATRIX entry of the xalign.parms file. If you like the gap penalties the way they are, then try using the clustering character '*' to span over several amino acids in your sequence input.

  4. If the alignment has gaps in your beta or helical regions you may want to increase the secondary structure gap penalties in XALIGN:SCOR_MATRIX. The default values are set fairly low to allow xalign to find the correct alignment even if there are mistakes in secondary structure assignment.

  5. Look at the weights assigned to the amino acid similarity matrix found in XALIGN:SCOR_MATRIX. If certain amino acid or structure matching is very important, you may want to increase these scores. The defaults should handle most cases though.

  6. The slowest part of the multiple alignment algorithm is determining pairwise alignments. You can greatly increase the speed of xalign by not printing pairwise alignments AND by pre-ordering your sequences. Be careful if you decide to order your sequences in the alignment rather than having xalign do it. You can easily get some pretty strange alignments if a remote sequence is processed near the beginning of a multiple alignment.

  7. Consider the case where you have a remote sequence which gets processed early in the multiple alignment because it has secondary structure determined . In this case it may be better to order the sequences so that the remote one is near the end OR drastically lower the XALIGN:STRUCT_BIAS parameter.

  8. Sometimes the weight of several identical or extremely homologous sequences can greatly constrain the options available for adding a remote sequence to the alignment. If this is your scenario, attempt the alignment of the remote sequence with only a couple of the extremely homologous sequences and compare the results.

  9. Sometimes it is difficult to decide if a remote homology is "real" or just a chance occurrence of amino acids. First try doing a multiple alignment of those sequences which you know are similar. If adding the remote sequence greatly changes the alignment, be suspicious.

  10. If the alignment seems to have too many or too few gaps try changing the gap penalty in the XALIGN:SCOR_MATRIX.

  11. If the alignment gaps seems unreasonably big try increasing the gap size penalty in the XALIGN:SCOR_MATRIX.


Alignment Parameters

The "xalign.parms" parameter file contains default settings for gap penalties, amino acid similarity and also some useful output options. The program looks for an "xalign.parms" file in the current directory, if one does not exist it uses the one in $INSTALL/lib/xalign/xalign.parms.

Users who are interested in changing some of default settings in order to get better alignments may want to copy the above file to their current directory and try various changes.


Last modified: Mar 19, 1997

Robert Boyko - robert.boyko@ualberta.ca