Stochastic 3d Protein Alignment

I've started working on a way to generate synthetic 1d alignments in an attempt to find the one that generates the best 3d alignment for a constant number of pairs of atoms. In the examples below, I'm aligning 1LYZ and 2LZM as in the 1993 Holm & Sander 'Dali' paper. In each example, random subsets of N > 3 atoms were paired in each protein, and the subset that generated the best 3d alignment is shown.

When N = 3 pairs, you can get the 3d fit down to a very small RMSD value:

Even with 5 pairs, the best RMSD is below 1 Angstrom:

But, with 10 pairs, the 3d fit starts to become poorer:

With 25 pairs, worse still. Nevertheless, notice that there is structure in the dot plot, and the 'best' synthetic 1d alignment tends to show clustering in nearby atoms. Some of these clusters even fall near the Dali alignment:

50 pairs:

100 pairs:

In this case, even the synthetic 1d alignment is better than Needleman-Wunsch for a 3d fit, although it uses slightly fewer pairs. This work is promising, but there appear to be two problems:

  1. Entropy. Based on the NW and Dali alignments, one might expect a good 1d/3d alignment to consist of short line segments of nearby atoms, with gaps in between. These segments could be from bottom left to top right for subsequences going in the same direction, or from top left to bottom right for antiparallel subsequences. Purely random alignments are unlikely to converge on this kind of coherent structure. What I might try is to start with a linear sequential alignment, and perturb it away from linearity --rather than the other way around-- to see if it converges.

  2. Freezing. Since there can only be one dot per row and column, alignments with many pairs tend to get stuck in a particular configuration with nowhere to go. I need to find a way to 'unfreeze' these configurations away from local minima so that the error can continue to drop.

I've solved these problems, and now when aligning on 100 atom pairs, the best I have done is an RMSD of 2.88 Angstroms. The 1d alignment shows significant clustering:

Another 3d alignment at about the same RMSD value shows a similar 1d pattern:

Other, slightly higher 3d values show slightly different --but similar to one another-- 1d patterns:

However, a better 3d value shows a very different 1d pattern:

For 77 atom pairs, the best I have done was 2.54 A. The Dali result in the HS paper was 4.2 A, although the 1d alignment was somewhat different:

For 42 atom pairs, the best I have done was 1.86 A. The Dali result in the HS paper was 2.2 A:

Although I am now getting good alignments, the biological significance of the results is questionable:

  1. Why do 3d alignments with similar RMSD values result from very different 1d alignments? Do these represent equally likely conformational geometries of actual molecules (e.g. enzymes, prions)?

  2. Although there is clustering in the 1d alignments, clusters are not generally linear, contiguous, subsequences of atoms, whereas real proteins appear to have linear homologous segments. Perhaps I should try constraining the random 1d alignments to only allow contiguous linear subsequences of length N ≥ 4, as suggested by the H&S paper.

İSky Coyote 2007