decorative
decorative
 
RESEARCH
[ CNB ]

[ CSIC ]
decorative

SPI-EM Detecting the CATH superfamily in 3D-EM maps

 
Velazquez-Muriel, J. A., Sorzano, C. O., Scheres, S. H. and Carazo, J. M. (2005) SPI-EM: Towards a Tool For Predicting CATH Superfamilies in 3D-EM Maps. J Mol Biol, 345, 759-771

 
SPI-EM is a new tool for determining the homologous superfamily to which a protein domain belongs looking at its three-dimensional electron microscopy map. The homologous superfamily is assigned according to the domain-architecture database CATH. Our method follows a probabilistic approach applied to the results of fitting protein domains into maps of proteins and the computation of local cross-correlation coefficient measures. The method has been tested and its usefulness proven with isolated domains at a resolution of 8 and 12 Å. Results obtained with simulated and experimental data at 10 Å suggest that it is also feasible to detect the correct superfamily of the domains when dealing with electron microscopy maps containing multi-domain proteins.. Our procedure is complementary to other techniques existing in the field to detect structural elements in electron microscopy maps like α-helices and β-sheets.

Introduction

Different manual or computational techniques for finding protein domains into 3D-EM maps have been reported. In the manual approach, the researcher employs visualization programs to fit a high resolution structure, typically obtained by X-ray crystallography, into a medium resolution 3D-EM map.. Most computational approaches are based on finding the orientation of the high-resolution structure in the 3D-EM map by maximizing a measure of fitness. The most popular measures of fitness are the cross-correlation coefficient (CCC) and its variants: local cross-correlation coefficient (LCCC) and rotational correlation function. All computational techniques tend to fail when only a part of the structure is fitted, because of blurring of surrounding domains that occurs when the resolution is lowered. As the resolution decreases, every atom blurs into a bigger region of space, and this region can be situated inside the space occupied by the domain of interest. The maximum of the CCC can place the high-resolution structure in a wrong position due to the presence of surrounding electron densities interfering with the density at the correct position. The local cross-correlation coefficient minimizes the influence of the surrounding density, because the CCC is only computed in the voxels occupied by the fitted domain. In both approaches, the researcher assumes that the high-resolution structure that he or she is fitting is actually present in the 3D-EM map and in the same structural conformation.

Because a given domain may not be present, or appear in a slightly different conformation, our  aim is not to find that particular domain, but to determine with a high probability to which homologous superfamily (if any) the domain belongs. Superfamilies are defined by domain databases like CATH, SCOP  or FSSP.

 

SPI-EM Probabilistic approach to superfamily assignment

We follow this approach to determine the superfamily to which a given 3D-EM map belongs:

  • First, the LCCCs  of fitting the domains of a CATH superfamily into the 3D-EM map are combined.
  • Second, we compare MCC, the arithmetic mean of the LCCCs, with a background distribution of arithmetic means for the CATH superfamily that we are testing. We use a set of all H-level representatives (Hreps) given by CATH v2.4 to generate the background distribution.  It is obtained by fitting each member of the superfamily against all Hreps, and computing the mean of LCCCs for each Hreps.
  • Finally, the probability that the mean MCC value is significant is computed by reporting the fraction of means in the background distribution with a poorer value. This fraction is our P-value. It is computed directly as the corresponding value for MCC in the cumulative background distribution. We interpret this P-value as the probability that the domain present in the 3D-EM map belong to the CATH superfamily tested. This test is repeated with every superfamily, and the superfamily with the highest P-value should be the one the domain belongs to.

The fitting method can be any of those described in the introduction. In this work, we chose FRM  because of its high speed.


Figure 1Flow diagram of the methodology. Each Hreps domain is fitted into all the elements of the superfamily, and the mean is taken to obtain the set of MCC values. The cumulative distribution of MCC is built with them. To determine the probability for a 3D-EM map of belonging to the superfamily 1, the map is fitted into all the elements of the superfamily, the MCC is computed, and its P-value read from the cumulative distribution. D and Q-values are computed using the MCC of the Hreps representative of superfamily 1 and the superfamily internal variability (standard deviation of the LCCC values).  

Figure 2. Cumulative distribution for the mean of LCCC for 1.20.1060.10 (Taq DNA Polymerase; Chain T, domain 4) at 8, 10 and 12 Å resolution. . The Y and X-axis represent the P-value and the mean MCC respectively.

D-value and Q-value

We developed two quantities to complement the P-value in measuring the absolute error of MCC. First, the value of the difference between MCC and the mean of LCCCs for the superfamily representative in Hreps (MCC,Hrep) is computed:

    (1)

 A positive D-value indicates that the domain present in the 3D-EM map is more similar to the domains of the superfamily tested than the superfamily representative present in Hreps. Therefore, it is an indication that the map's domain belongs to the superfamily. A D-value below zero indicates the opposite situation that the map's domain is less similar to the superfamily members than the representative. D is affected by the resolution: the higher the resolution, the smaller the D-value if the domain in the 3D-EM map does not belong to the superfamily.

Second, we measure the relationship between the D-value and the standard deviation of the LCCCs observed when the superfamily representative domain in Hreps is fitted into the rest of the domains. We call it the Q-value:

(2)

A value of Q close to zero is desirable, as it indicates that the difference in similarity between the query map and the superfamily representative is not large in terms of the similarity among the superfamily members.

Use of the P, D and Q-values

In order to test the general applicability of these measures, we tested the SPI-EM approach to determine for 28 different domains the correct superfamily among all CATH superfamilies with more than 10 (to obtain useful mean values) and less than 100 elements (for reasons of computational speed). Each of the 28 domains was selected from a different (CATH) A-level, to form a diverse as possible test set representing all protein topologies. In total, 477 CATH superfamilies satisfied the criteria mentioned above, and background distributions were computed for all of them.

 In 19 of the 28 cases, the correct superfamily was detected with the highest P-value. In 5 cases, the correct superfamily was among the 4 highest P-values and the D and Q values were comparable or better than the other three first positions. In the remaining 4 cases, the correct superfamily could not be detected. The criteria used to consider a domain belonging to a superfamily are not fixed, and depend on resolution and the variability among the superfamily members.

Superfamily aggregation: Towards topologies

At lower resolutions, different domains become more similar to each other and it becomes harder to distinguish one fold from another. At sufficiently low resolutions a given superfamily may become indistinguishable from another one, even if the domain architectures are quite different. This makes it impossible to correctly assign the correct superfamily only based on LCCC at very low resolutions.

We studied the aggregation at 12 Å for superfamilies 1.20.1060.10 and 1.10.530.40. Both superfamilies have mainly alpha architecture, with 5-6 and 11 α-helices respectively.  Each member of 1.20.1060.10 was fitted into all the domains of 1.10.530.40 and vice versa. The criteria used to consider a domain belonging to a superfamily were P-value>0.90, D-value>-0.1 and Q-value<1. Several domains of 1.20.1060.10 could be assigned to superfamily 1.10.530.40, and this effect is even clearer when fitting domains of 1.10.530.40 into 1.20.1060.10. This implies that either superfamilies, or at least part of them, are nearly indistinguishable at 12 Å, although they are quite different at atomic resolution. As an example, Figure 3(a) shows a domain from superfamily 1.20.1060.10 fitted into the 12 Å density of a domain belonging to 1.10.530.10. Although the secondary structure organization of the two domains differs considerably, at 12 Å the volumes are very similar, resulting in a relatively high LCCC value of 0.8533. This illustrates that high LCCC values do not necessarily imply that the underlying atomic structures are indeed similar. If the study is repeated at a resolution of 8 Å, the two superfamilies are clearly separated (data not shown).

To show that superfamilies from very different classes can also aggregate, a similar experiment was performed with superfamilies 1.20.1060.10 (mainly-alpha) and 2.60.11.10 (mainly-beta).  At 12 Å these families are aggregated, with P-value>0.99, D-value>-0.1 and Q-value<0.2, but at a resolution of 8 Å they can be separated (Q-value between 2 and 14). Figure 3(b) shows an example of a domain from superfamily 1.20.1060.10 fitted into a member of superfamily 2.60.120.40 at 12 Å. Although the backbone structure of the two domains is very different, a high LCCC value of 0.879 was obtained. This is explained by fact that many of the voxels of one domain are contained into the other.

(a)

(b)

Figure 3. (a): Fitting of 2KZZA2 (1.20.1060.10, mainly-alpha, yellow map and blue skeleton) into 150LD0 (1.10.530.10, mainly-alpha, red map and skeleton) at 12 Å. (b): Fitting of domain 3BDPA2 (1.20.1060.10, mainly-alpha, green) into 4TSVA0 (2.60.120.40, mainly-beta, purple) at 12 Å.

Application to multi-domain maps with SITUS-COLORES

The statistical approach introduced in this paper can be extended to multi-domain 3D-EM maps, provided that individual domains are correctly located within the multi-domain map. We used the program COLORES 16 integrated in the SITUS suite for this task. Since the LCCC is used as similarity measure in our methodology, surrounding voxels of other domains should not interfere in the CCC computation. Therefore, our methodology, which was developed for single-domain 3D-EM maps, is still valid for detection of superfamilies within multi-domain 3D-EM maps.
 

Application to experimental multi-domain maps

We studied the 10.3 Å EM-structure of unliganded, E. coli GroEL chaperonin (EMD database code: EMD-1042) to illustrate the applicability of SPI-EM to experimental 3D-EM maps. GroEL contains 3 different domains: GroEL domain 1, (1.10.560.10), GroEL domain 2 (3.30.260.10) and the apical domain (3.50.7.10), which are repeated 14 times throughout the structure. The results of their superfamily detection are presented in Table 1.

Table 1. Superfamily detection for domains of experimental 3D-EM map EMD-1042 (GroEL chaperonin, unliganded from E. coli). Resolution: 10.3 Å. Shadowed results are the three superfamilies present in the map.

3D-EM map

Superfamily

MCC

P-value

D-value

Q-value

EMD-1042

1.10.560.10

0.642

0.663

-0.285

1.987

EMD-1042

3.50.7.10

0.584

0.206

-0.231

2.298

EMD-1042

1.10.220.10

0.521

0.134

-0.417

4.858

EMD-1042

3.30.260.10

0.553

0.118

-0.413

6.077

EMD-1042

1.10.1250.10

0.495

0.108

-0.374

3.639

EMD-1042

1.20.1060.10

0.448

0.105

-0.409

4.715

Superfamily 1.10.560.10 was readily detected, and superfamily 3.50.7.10 emerged as the second candidate, despite the fact that its P-value is relatively low. Superfamily 3.30.260.10 was not detected. To compare these results to those ones obtained if the density map were of still better quality, the experiment was repeated with a 10 Å simulated EM-map, which was based on the corresponding PDB structure (1GR5) for EMD-1042 . Now superfamilies 1.10.560.10 and 3.50.7.10 are both correctly detected with P-values of 0.962, and 0.847, respectively. For superfamily 3.30.260.10 a P-value of 0.431 was obtained, which is relatively low compared to the values for the other two superfamilies. This indicates that this superfamily is more difficult to detect, and suggests that this fold lies at the limits of the method ability.

As in the simulated multi-domain case, the superfamily rejection criteria based on P, D and Q-values need to be relaxed when the method is applied to the experimental multi-domain case. In this case, the noise present in the EM micrographs and the reconstruction procedure, as well as a potentially lower signal power at frequencies close to 10 Å, form additional sources of error lowering the LCCC values.

Discussion

Multi-domain 3D-EM maps represent a further challenge to superfamily detection. Since, especially for lower resolutions, domain boundaries cannot be recognized and the blurring of neighboring domains in the relevant one lowers the sensitivity of superfamily detection. The SPI-EM approach was tested on both simulated and experimental multi-domain maps. Although the rejection criteria should be greatly relaxed in these cases, our results suggest that superfamily detection may still be possible in such maps. Finally, it is worth noticing that the presented statistical approach is independent on the fitting program used, provided that this program allows fitting of single domains (of any architecture) in multi-domain maps. This implies that SPI-EM may benefit from any advances in the field of domain fitting in low-resolution maps.