Dali1dPlugin - Dali1d sequence alignment plugin for Geneious
I've created a Dali 1d sequence alignment plugin in Java for Geneious 3.0.6.
The following example shows the alignment of 1LYZ to 2LZM, as described in 'J Mol Biol. 1993 Sep 5;233(1):123-38.
Protein structure comparison by alignment of distance matrices.
Holm L, Sander C.'. In Geneious:
- Select the structures 1LYZ and 2LZM:
- Click the 'Alignment' tool button, and select 'Pre-align 3d structures'. You
should see the following default settings:
These settings are:
- Pattern length: The number of contiguous residues in each protein which are inter-compared to build
initial alignments. The H&S paper uses hexapeptides.
- Distance threshold (A): The maximum intra-pattern (of two fragments on one protein) distance to consider when
creating the pattern similarity list. The H&S paper suggests 25 A, but a shorter distance produces smaller
lists (and shorter runtimes) without significantly degrading the initial alignments.
- Initial pattern list size: The maximum number of positively scoring pattern x pattern comparisons between the two proteins to
save. 80000 in the H&S paper.
- Final pattern list size: The truncated number of pattern x pattern comparisons to retain after comparisons have
been performed. 40000 in the H&S paper. The final pattern list is used to create seed alignments for the cycle of iterations.
- Population size: The number of candidate alignments to try during each iteration. A number from 1000 - 5000 is usually
sufficient. Larger populations allow for more variation in the alignment search, but take longer to run.
- Number of parents to keep: The number of high-scoring alignments to keep across iterations. All other
candidate alignments in the population are created from these. The H&S paper suggests 10. Using a population size of
1010 and 10 parents makes 100 new alignments from each pair of parents each iteration.
- Stability threshold (iters): The maximum number of iterations during which the top alignment score does
not change (freezes at a non-optimal solution) before all parent alignments are subjected to bit-clearing 'knockout'
(see below). The H&S paper performs 'knockout'
after a fixed number of iterations, but I have found that allowing iterations to continue until the leader does
not improve for a while seems to work better. 5-10 iterations is a good choice. Set this to a larger value to allow all trajectories to continue
to evolve without being subjected to periodic bit-clearing.
- Maximum number of iterations: The maximum nuber of iterations to perform before returning the best alignment.
- Minimum alignment length: The minimum number of aligned residues before returning the best alignment.
- Minimum similarity score: The minimum similarity score before returning the best alignment.
Iterations proceed until one of these criteria are true.
- 'Clear bits' probability: The probability that mutation of a new alignment will involve clearing bits in the alignment
mask. The H&S paper performs a 'trimming cycle' on all alignments every 5 iterations. However, I have found that performing
trimming, expansion, and bit swapping on a continuous basis with fixed probabilities seems to work better.
- 'Swap bits' probability: The probability that mutation of a new alignment will involve moving bits in the alignment
mask from one place to another without changing the length of the alignment. There is no analog of this in the H&S paper.
- 'Set bits' probability: The probability that mutation of a new alignment will involve setting bits in the alignment
mask. The H&S paper performs an 'expansion cycle' on all alignments 4 out of every 5 iterations.
The H&S paper confines all trimming and expansion operations to units of 4 peptides. However, I have found that randomly
choosing peptide sizes of 1, 2, 4, or 6 contiguous bits each operation seems to work better.
- Knockout length (fraction): The fraction of bits to randomly clear in each parent alignment after the leading alignment has
reached stability according to the setting above. The H&S paper uses 20% of the total length.
- Print informational output to console: Print ongoing run information to the console, including intra-pattern lengths, pattern
similarity size and scores, seed construction, and best alignment during each iteration. See below for an example of such output.
- Click OK to start the alignment. A progress dialog will appear, showing an estimate of the percentage of the alignment
calculation that has been performed. Click 'Cancel' to abort the alignment in progress. If you have chosen to print to the console,
information about the progress of the alignment will appear there.
Here is an annotated example of output displayed during the calculation of an alignment.
- When the alignment is complete (about 6 minutes for the defualt settings on a 2 GHz MacBook) it will be displayed in
This particular alignment has 75 corresponding residues and a similarity score of 243.6, which is exceptionally good.
To perform a 3d alignment using the Align3d plugin:
- Select an existing 1d alignment made from 2 PDB files (such as the alignment just completed above), click the 'Alignment'
tool, and choose 'Align 3d structures'. You should then see the following dialog:
These settings are:
- Number of iterations: Number of steps of the McLachlan conjugate axes minimization algorithm to perform. At each step,
3 different axes will be used to rotate the movable protein into alignment with the fixed protein. Usually, just a few
iterations will be required to minimize the RMSD error.
- Index of fixed (reference) sequence/structure: 1-based index of the fixed protein's sequence in the selected alignment.
- Index of movable (aligned) sequence/structure: 1-based index of the movable protein's sequence in the selected alignment.
- Print informational output to console: Print information about the alignment to the console.
- Click 'OK' to perform the alignment. If output to the console is selected, it will be displayed there, e.g:
===== Wednesday, August 15, 2007 10:11:21 AM US/Pacific =====
*** Align3dPlugin ***
iterations = 6
fixedIndex = 1
movableIndex = 2
outputFlag = 1
"Protein alignment 33" SequenceAlignmentDocument with 2 sequences:
1 "1LYZ" PdbDocument 1LYZ KVFGRCELAAAMKRHGLDNYRGYSL...
2 "2LZM" PdbDocument 2LZM -----------------------MN...
Fixed PDB = 1LYZ, fixed alignment = KVFGR...
Movable PDB = 2LZM, movable alignment = -----...
Calling Align3d(1LYZ, 2LZM, KVFGR..., -----...)
1LYZ has 129 nodes
2LZM has 164 nodes
2LZM has 75 links to 1LYZ
2LZM center at 40.880933 -8.357147 15.104293
1LYZ center at 0.871867 21.397107 21.101840
Initial error = 16.322712
0.574672 0.768853 -0.280385
-0.552787 0.112039 -0.825756
-0.603472 0.629533 0.489398
- Once the alignment has been performed (just a second or two), the movable protein's coordinates will be transformed and
displayed in Geneious. A text view of the transformed protein will also include information about the alignment:
Both the transformed and reference proteins are included in the output PDB file, so they can be displayed in one window.
Here is 1LYZ+2LZM(aligned) in Geneious Basic:
The 1d alignment was length=78, score=245.805954, rmsd=3.757917. You may need
Geneious Pro to color the two chains differently. However, you can do it externally in the free version of Jmol:
Tips for use
- The default settings are pretty good for finding reasonable alignments in a fairly short time. To find better
alignments at the expense of time, try using a larger population (e.g. 2525 with 25 parents, or 5050 with 50 parents, etc..). To
keep the parents from being reset as often, set the stability threshold to a larger value (e.g. 10-25 iterations or more), or set the
knockout fraction to a smaller value (e.g. 0.05-0.10).
- Set the distance threshold to a larger (e.g. 25-50 A) or smaller (e.g. 10-20 A)
value to create more or fewer initial seed alignments. The only purpose of calculating the pattern similarity list is to
start the iterations with a set of good seeds. It is possible to get good alignment results with fewer initial seeds, possisbly
trading additional iteration time for less pattern comparison time.
- The alignment calculation is stochastic, and it will generally not return the same answer every time. It may therefore
be best to perform the alignment several times, and then keep the result that produces the largest number of paired residues,
the greatest similarity score, or the smallest 3d RMSD error, whichever is more important to you. The alignment similarity score
is not linear: that is, a small change in alignment (1 set of residues) can have a significant effect on the overall score
and resulting RMSD value.
© 2007 Sky Coyote