Abstract

Motivation: Functionally related genes involved in the same molecular-genetic, biochemical or physiological process are often regulated coordinately. Such regulation is provided by precisely organized binding of a multiplicity of special proteins [transcription factors (TFs)] to their _target sites (cis-elements) in regulatory regions of genes. Cis-element combinations provide a structural basis for the generation of unique patterns of gene expression.

Results: Here we present a new approach for defining promoter models based on the composition of TF binding sites and their pairs. We utilize a multicomponent fitness function for selection of the promoter model that fits best to the observed gene expression profile. We demonstrate examples of successful application of the fitness function with the help of a genetic algorithm for the analysis of functionally related or co-expressed genes as well as testing on simulated and permutated data.

Availability: The CMA program is freely available for non-commercial users. URL Author Webpage. It is also a part of the commercial system ExPlain™ (Author Webpage) designed for causal analysis of gene expression data.

Contact:  [email protected]

1 INTRODUCTION

Massive application of microarray measurements of gene expression is a common route in the current studies of disease mechanisms. Hundreds of genes are revealed whose change of expression is associated with the disease. However, changed expression of these genes often represents just an ‘echo’ of the real molecular processes in the cells. Still, the available means of analysis and interpretation of these mechanisms are very limited.

Regulation of gene expression is accomplished through binding of transcription factors (TFs) to distinct regions of DNA. It is clear by now that combinations of TFs rather than single a TF drive gene transcription and define its specificity. At the level of DNA, the blueprints for assembling such variable TF complexes on promoter regions may be seen as specific combinations of TF binding sites located in close proximity to each other. We call such structures ‘composite regulatory modules’.

Known functional combinations of TF binding sites were used before in a number of approaches of promoter analysis: for the identification of muscle-specific promoters (Frech et al., 1998; Wasserman and Fickett, 1998), promoters of liver-enriched genes (Tronche et al., 1997), of yeast genes (Brazma et al., 1997), of immune-specific genes (Boehlk et al., 2000; Fessele et al., 2001; Kel et al., 1999), promoters of genes regulated during cell cycle (Kel et al., 2001) or antibacterial defense response (Shelest and Wingender, 2005). In the cases when the functional site composition is not known, a number of new predictive approaches identifying composite motifs can be used: BioProspector (Liu et al., 2001), Co-Bind (GuhaThakurta and Stormo, 2001), MITRA (Eskin and Pevzner, 2002), dyad search (van Helden et al., 2000). These programs help to discover new regulatory sites for yet unknown TFs, but an ‘ab initio’ motif finding method is limited in their ability to distinguish signal from noise in the long regulatory regions (10 Kb and more) of genes of human and other higher eukaryotic organisms. Novel methods have been developed that utilize information on known and potential binding sites in promoters (ClusterScan: Kel-Margoulis et al., 2002b; TOUCAN system: Aerts et al., 2003; probabilistic methods: Sinha et al., 2003). But still analysis of long regulatory regions remains a challenge for these approaches.

In this paper we are describing a novel tool, Composite Module Analyst (CMA), which has been created to make the next step in applying combinatorial approach to the analysis of regulatory regions of functionally related or co-expressed genes. This tool applies a novel approach for defining the promoter model based on composition of single TF binding sites as well as their pairs located inside local regulatory domains (corresponding to enchancer/silencer subregions). We utilize a multicomponent fitness function for selection of the promoter model, which fits best to the observed gene expression profile. We demonstrate some successful applications of the fitness function with the help of a genetic algorithm for the analysis of functionally related or co-expressed genes as well as testing of the tool on simulated data.

2 COMPOSITE MODULES

2.1 Databases and tools for identification of TF binding sites

In this study we used three databases: (1) TRANSFAC® is a database on gene regulation (Matys et al., 2006). It provides data on TFs and their binding sites in promoters and enhancers of eukaryotic genes as well as a library of position weight matrices (PWMs). This work has been done with release 8.4. (2) TRANSCompel® (rel.6.4) which contains known composite regulatory elements in mammalian genes (Kel-Margoulis et al., 2002a). (3) TRANSPRO™ is a database on promoters (Chen et al., 2006). Positions of transcription start sites (TSS) in genomes (human, mouse, rat) are collected from EPD, DBTSS and Ensembl, providing the possibility to retrieve about 13 000 human and about 10 000 murine promoters. Potential binding sites for TFs were identified by the TRANSFAC® tool Match™ (Kel et al., 2003) that uses a library of about 500 PWMs for vertebrate TFs (TRANSFAC® release 8.4).

2.2 Definition of composite promoter model

We model the promoter structure by a Boolean function θ from which the promoter score ps can be computed:
(1)
where bi are the (0,1) outputs of m independent composite modules (CMs) cmi (i = 1, 2, … , m).
Each composite module is a triplet (Φ,M,Ψ), where Φ is the set of TFs regulating the promoters; M is a set of PWMs that are used to identify potential binding sites for the TFs from the set Φ; and Ψ is a set of rules and parameters that are applied to the set M to search for the potential DNA binding sites in promoters. In the current realization of the method we consider the following rules and parameters Ψ: First of all, the program searches for potential TF sites in the promoters using Match™ algorithm by applying the matrices M and corresponding cut-off values. Next, in each sliding window of the length w, the program selects the predefined maximal number of the best matrix matches and checks if the found sites obey the predefined distance and orientation rules. A maximal normalized composite score value cmsi is calculated among all window positions using the following equation:
(2)
where qi(k)(X)<qcut-off(k) and q1,2(r)(X)<qcut-off(r); and distance between matches of two matrices of a pair (r) : dmin(r)<d(r)<dmax(r); Max (cms) is the maximal possible value of cms.
  1. w  i, the length of the window in promoter sequences that covers all sites included in the CM;

  2. Ki, the number of individual PWMs in the module,

  3. Ri, the number of pairs of PWMs,

  4. cut-off value qcut-off(k), relative impact values ϕ(k), maximal number of best matches κ(k) are assigned to every individual weight matrix k (k = 1, Ki),

  5. cut-off value qcut-off(r), relative impact values ϕ(r), orientation values o(r) and maximal dmax(r) and minimal dmin(r) distances and maximal number of best pairs of matches κ(r) are assigned to every matrix pair r (r = 1,Ri) in the CM.

A threshold parameter Pi is defined for each composite module. It is used to compute the corresponding Boolean value:
(3)
Finally, the promoter score ps is calculated using the Boolean function (1). If, for a given promoter, ps = 1 we consider that this promoter matches the defined promoter model.
In order to score the values obtained with the Boolean function θ we applied the fuzzy logic approach. For each promoter we define a fuzzy score λ (or fuzzy promoter score fps) which is based on the scores λi = cmsi of each component of the predicate of the function θ. The following calculation rules are applied:
(4)

2.3 Fitness function

Based on the assumption that promoters of co-regulated genes should share a common composition of TF binding sites we propose an approach which allows defining a composite promoter model for sets of co-regulated genes. We have constructed a fitness function that provides a means to direct the search in a wide space of various parameters and to find the optimal promoter model settings. In this approach we take as input a set of promoters of potentially co-regulated genes (set A) and a background set of promoters (set B) and design a fitness function, which is based on several properties of distributions of the θ function and cmsi values in these two promoter sets. As an alternative input we consider a single set of promoters with assigned expression values.

The components of our fitness function are the following:

  • R, regression value;

  • T, Student t-test value;

  • E, specificity and sensitivity value;

  • N, normality index;

  • P, penalty on the complexity of the model.

The fitness function is defined as linear combination of these components with the specified relative weights:
(5)
The weights a, b, c, d, e can be modified by users. We will describe the components in detail.

2.4 Regression value

This component shows how well the fuzzy promoter scores fps(n) of the promoters (n = 1, NP-total number of promoters) fit the expression values ξ(n) assigned to the corresponding genes. The ξ(n) values can be assigned either as 1/0 values to the promoters of two corresponding sets of genes A and B, or as any continuous function of the observed differential expression value (for instance as the log ratio of gene expression provided by Affymetrix software). The linear regression is done by fitting the fps(n) to the curve αξ(n) + β. We compute then the regression value R2 for the promoter model and normalize it to the maximal possible value. It is taken then as the component R in Equation (4) which shows the fit of the considered composite modules in the promoters to the differential expression of the corresponding genes.

2.5 Student's t-test value calculation using fuzzy logic

This component is equal to the value of the two-sided Student's t-test with different variances. Here we consider two alternative promoter sets, A and B, described above or, in the case of the alternative input of a single set of promoters with assigned expression values, the promoter set is divided into two groups A and B: with high and low differential expression values (in this case the value P0 dividing these two groups is specified by the user).

After that, the t-test value T is calculated:
(6)
Here, niλi¯,σi, are the number of promoters in each group; an average value of the fuzzy score λ (fps) in each group of promoters and the standard deviation, respectively. T99.9 is the 99.9% confidence interval values of the Student's t-distribution for the corresponding number of degrees of freedom. If T becomes higher than T99.9 then T = 1.0. The Tcomponent of the fitness function shows the statistical significance of the difference between two distributions of composite module scores.

2.6 Specificity and sensitivity value

This component of the fitness function regulates the mean error rate. It equals to
(7)
where FN is the false-negative rate, i.e. proportion of promoters in group A (with ξ = 1, or ξ ≥ P0 in the alternative input), getting ps = 0 (Equation 1), and FP is the false positive rate, i.e. proportion of promoters in group B (with ξ = 0, or ξ < P0 in the alternative input), getting ps = 1. The constant 0 ≤ γ ≤ 1 is defined by the user and gives relative impact of true positives versus true negatives. It is set to 0.5 by default.

2.7 Normality index of the distribution

This component shows how close the distribution of the fuzzy promoter score is to the normal distribution. Let λ¯,σ be the average fuzzy score and their standard deviation for promoters of the group A; p is the fraction of promoters whose fuzzy score belongs to (;λ¯σ)(λ¯+σ;+). Then the N component of the fitness function is defined as
(8)
The normality index allows downgrading promoter models that are characterized by irregular distribution of the scores among the promoters under study. It prevents from overfitting and guarantees stability of the predictions on the control data.

2.8 Penalty on the complexity of the model

This component prevents unnecessary growth of complexity of the promoter model. In the case when the number of composite modules (M), the number of matrices (Ki) and pairs (Ri) in each composite module are minimal (the minimal number is defined by the user), P becomes equal to 1, otherwise it decreases according to the growth of the size of the model.
(9)
Here, max and min values for the parameters M, K and R are defined by the user of the program based on the expectations of the optimal complexity of the model. The P component of the fitness function is used to avoid adding some unnecessarily complex composite modules and adding new matrices or pairs that actually do not have any effect on the other components of the fitness function, or exert a very minor influence.

2.9 Genetic algorithm implementation

In order to identify a composite promoter model best fitting the given data on gene expression we use the fitness function described above and apply an optimization strategy based on a genetic algorithm. The algorithm takes as an input two sets of promoters (the set of promoters of differentially expressed genes, A group; and a set of promoters of genes whose expression does not differ significantly between experiment and control, B group) or a set of promoters and the relative expression values assigned to the corresponding genes. On the first step, the program Match™ is used to identify all potential TF binding sites in these two sets of promoters. It uses a predefined subset of PWMs from the TRANSFAC® database, usually with the minFN cut-offs (minimizing the false-negative rate; see Kel et al., 2003 for details) that allow to identify most of the sites in the sequences. The genetic algorithm works then further with the binding sites found by Match™ and identifies the promoter model best fitting to the expression data.

Each model is characterized by its own set of parameters described in Subsection 2.2 and is represented in the genetic algorithm by 2 × m different ‘chromosomes’—2 chromosomes per composite module cmi (i = 1, 2, … , m). One chromosome to encode Ki individual matrices is associated with their cut-off values qcut-off(k), relative impact values ϕ(k) and maximal number of best matches κ(k) (k = 1, Ki). And the other chromosome to encode Ri pairs of PWMs is associated with their cut-off values qcut-off(r), relative impact values ϕ(r), (r) orientation values o(r) and maximal dmax(r) and minimal dmin(r) distances, and maximal number of best pairs of matches κ(r) (r = 1, Ri).

Initially, the program CMA generates a set of random models according to the predefined limits of the maximal and minimal number of matrices, pairs, distances between sites, matrix cut-offs and other parameters. These models are considered as a population of ‘organisms’ subjected to mutations, recombination, selection and multiplication. Mutations are done by random exchange of matrices in the model into any other matrices from the library and by random small increase or decrease of the associated parameters such as cut-off values, impact values and others. Recombination is done by random exchange of the parts of chromosomes between different models. During a number of iterations the program tries to find the ‘best’ model that is characterized by the highest value of fitness function. It identifies the set of matrices, optimizes their cut-offs, the relative impact values, maximum number of best matches in promoters and some other optional parameters such as site orientation and distance range and in pairs. The output of the CMA tool is the best discriminative promoter model with the optimized parameters.

More details of the encoding schema of the models in the genetic algorithm and definition of the mutation and recombination operators as well as the selection and multiplication rules and other options of CMA program are described in the documentation of CMA (see e.g. Author Webpage).

3 RESULTS

3.1 Testing on the simulated data

First, we did initial testing of the ability of the program to reveal combinations of single matrices. To test this we performed a series of simulation experiments. We generated sets of 1000 random nucleotide sequences of a length of 1000 bp each, ‘implanted’ a certain number of site combinations in random positions of these sequences based on a predefined composite module and let the program ‘blindly’ reveal the composite module back or at least some parts of it.

First of all, we observed that with increase of the size of the matrix library the accuracy of recognition (the percentage of correctly identified matrices as components of implanted composite modules) falls linearly (data not shown), but this can be generally recovered by increasing the number of iterations of the genetic algorithm. We also observed that the algorithm is able to recapture correctly not only CMs containing unique matrices but also modules comprising matrices that are similar to other matrices for other TFs characterizing by similar DNA binding patterns (e.g. V$GATA3_03 and V$MYOD_01). We found that the optimal parameters of the algorithm are the ratio of the population size to the number of iterations should be between 0.1 and 2; the optimal selection pressure is ∼40% (so the best 40% of the population goes into the next generation) and the optimal mutation level is ∼30% of population and recombination level ∼60%. With these optimal parameters we performed further testing. The results of the testing are shown in Tables 1–3.

Table 1

Revealing combination of three single matrices (V$AP4_01, V$E2F_02, V$GATA1_02) in the window of 200 bp

Site scoresProbability of site insertion
0.90.60.3
High+++++++++
Low+++++−++−
Site scoresProbability of site insertion
0.90.60.3
High+++++++++
Low+++++−++−

We implanted sites into 30% of the sequences (set A); other sequences in the set are taken as background (set B). Scores of the implanted sites are high (optimum to reduce false positives) or low (optimum to reduce false negatives). Sites were implanted probabilistically, so the probability that a site is implanted in a given sequence (P) varies from 0.3 to 0.9. +++ indicates that all implanted matrices were correctly identified back, ++− that only two were identified correctly and one was wrongly identified, in 100 iterations; population size was 200.

Table 1

Revealing combination of three single matrices (V$AP4_01, V$E2F_02, V$GATA1_02) in the window of 200 bp

Site scoresProbability of site insertion
0.90.60.3
High+++++++++
Low+++++−++−
Site scoresProbability of site insertion
0.90.60.3
High+++++++++
Low+++++−++−

We implanted sites into 30% of the sequences (set A); other sequences in the set are taken as background (set B). Scores of the implanted sites are high (optimum to reduce false positives) or low (optimum to reduce false negatives). Sites were implanted probabilistically, so the probability that a site is implanted in a given sequence (P) varies from 0.3 to 0.9. +++ indicates that all implanted matrices were correctly identified back, ++− that only two were identified correctly and one was wrongly identified, in 100 iterations; population size was 200.

The result of this simulation shows that the CMA program is able to determine implanted matrices correctly in most cases, even if just a few sequences of set A contain all of the sites of the module (e.g. for P = 0.3, only about 8 sequences out of 300 contain all 3 sites).

Microarray data usually contain a high level of noise and reveal no clear differentiation between ‘changed’ and ‘unchanged’ genes. The expression values vary considerably. In order to test the ability of our method to deal with such kind of noisy data we generated test data with various ratios of differential expression values (e) and the random variance of the expression value Δe (as a measure of noise). The result of this simulation (Table 2) shows that the CMA was able to reveal correct CM in cases of Δe < e and Δee, although the fitness is decreasing in the second case. In case of Δe > e the method was able to reveal correctly only two matrices used for the sites implantation.

Table 2

Testing the CMA program to analyze noisy expression data

Expression parametersFitness (Z)Random fitnessCM
Δe < e (x = 5)0.60.0045 ± 0.0035+++
Δee (x = 10)0.40.0029 ± 0.036+++
Δe > e (x = 15)0.30.0032 ± 0.0028++−
Expression parametersFitness (Z)Random fitnessCM
Δe < e (x = 5)0.60.0045 ± 0.0035+++
Δee (x = 10)0.40.0029 ± 0.036+++
Δe > e (x = 15)0.30.0032 ± 0.0028++−

‘Changed genes’ comprised 30% of the random sequences. Sites were implanted in these sequences only (P = 0.9). Matrices were the same as in the Table 1. High cut-offs were used for implantation. We assigned to each ‘changed gene’ an expression value randomly generated from the interval [10, 10 + x]. Other ‘non-changed’ genes got the lower expression value from the interval [0,x]. By varying x = 5, 10, 15 we simulated 3 variants of the data with gradually increased noise. ‘Random fitness’—values of fitness function obtained in the shuffling experiments.

Table 2

Testing the CMA program to analyze noisy expression data

Expression parametersFitness (Z)Random fitnessCM
Δe < e (x = 5)0.60.0045 ± 0.0035+++
Δee (x = 10)0.40.0029 ± 0.036+++
Δe > e (x = 15)0.30.0032 ± 0.0028++−
Expression parametersFitness (Z)Random fitnessCM
Δe < e (x = 5)0.60.0045 ± 0.0035+++
Δee (x = 10)0.40.0029 ± 0.036+++
Δe > e (x = 15)0.30.0032 ± 0.0028++−

‘Changed genes’ comprised 30% of the random sequences. Sites were implanted in these sequences only (P = 0.9). Matrices were the same as in the Table 1. High cut-offs were used for implantation. We assigned to each ‘changed gene’ an expression value randomly generated from the interval [10, 10 + x]. Other ‘non-changed’ genes got the lower expression value from the interval [0,x]. By varying x = 5, 10, 15 we simulated 3 variants of the data with gradually increased noise. ‘Random fitness’—values of fitness function obtained in the shuffling experiments.

In order to estimate the significance of the fitness values obtained in the analysis we have performed multiple shuffling experiments. In each such experiment we took all the expression values in the set and reassigned them to randomly chosen sequences. After that, we applied CMA to these shuffled samples and computed the average and standard deviation of the observed fitness values (Table 2, ‘random fitness’).

Next, we tested the functionality of the program on revealing pairs of matrices that reflect composite elements composed of two closely situated sites. One pair of sites was implanted into the 30% of random sequences (pair: V$AP4_01/V$MEF2_01; distance vary: 5–25; cut-offs ‘high’). After that, we tried to reveal this CM back by varying the parameters of the search (Table 3).

Table 3

Testing the CMA program to reveal implanted pairs of matrices

Parameters of the search of the CM structureProbability of site insertion
0.90.60.3
1 pair(++)(++)(++)
1–3 pair(++)(++)(++)
0–1 pair, 0–2 single matrices(++)++++
Parameters of the search of the CM structureProbability of site insertion
0.90.60.3
1 pair(++)(++)(++)
1–3 pair(++)(++)(++)
0–1 pair, 0–2 single matrices(++)++++

(++) indicates that the correct pair was found, ++ that two single matrices were revealed and they are correct components of the pair.

Table 3

Testing the CMA program to reveal implanted pairs of matrices

Parameters of the search of the CM structureProbability of site insertion
0.90.60.3
1 pair(++)(++)(++)
1–3 pair(++)(++)(++)
0–1 pair, 0–2 single matrices(++)++++
Parameters of the search of the CM structureProbability of site insertion
0.90.60.3
1 pair(++)(++)(++)
1–3 pair(++)(++)(++)
0–1 pair, 0–2 single matrices(++)++++

(++) indicates that the correct pair was found, ++ that two single matrices were revealed and they are correct components of the pair.

When the frequency of the pairs in the sequences is high (P = 0.9), the program is able to reveal the correct matrix pair under a wide range of search parameters. If the frequency is low, it becomes more difficult to predict the correct structure of the CM without sufficient knowledge on the expected structure which can help to set optimal parameters of the search. Anyway, in most of the cases the program is able to predict correctly the components of the CM even if its fine structure is not correctly predicted.

3.2 Analysis of promoters of co-regulated genes with the help of CMA

3.2.1 Test 1: T-cell specific genes

In order to check the ability of CMA to reveal known composite promoter modules we analyzed a set of promoters of T-cell specific genes that have been shown to be regulated by a very specific type of composite elements: NF-AT/AP-1. The set includes genes for several interleukins/(IL-1,4,5,8); their receptors, signaling molecules such as TNF-alpha, IFN and others. It is based on the collection of experimentally proven NF-AT/AP-1 composite elements in TRANSCompel database (Kel-Margoulis et al., 2002a). In our earlier work (Kel et al., 1999) we showed that the promoters of these genes contain many copies of the NF-AT/AP-1 composite elements. Here, we would like to check if CMA would be able to reveal these composite elements without having explicit knowledge about their composition. Results of this test are given in Figure 1.

Example of CMA output of the composite promoter model revealed in T-cell specific promoters by optimization of the fitness function with the genetic algorithm. Set A contains 26 promoters of an average length 1000 bp; background set B contains 250 randomly generated sequences with the same nucleotide composition as in the set A. The promoter model found by the program consists of a Boolean function (Predicate) of two small CMs (K1, K2) connected by the logical OR. Each CM is represented by a pair of matrices found in a window of 50 bp. NF-AT/AP-1 pair is correctly revealed by the program.
Fig. 1

Example of CMA output of the composite promoter model revealed in T-cell specific promoters by optimization of the fitness function with the genetic algorithm. Set A contains 26 promoters of an average length 1000 bp; background set B contains 250 randomly generated sequences with the same nucleotide composition as in the set A. The promoter model found by the program consists of a Boolean function (Predicate) of two small CMs (K1, K2) connected by the logical OR. Each CM is represented by a pair of matrices found in a window of 50 bp. NF-AT/AP-1 pair is correctly revealed by the program.

The (V$AP1_Q4@0.872000<*V$NFAT_Q6@0.791000,3,1..14) line in the output describes a pair of matrices found by the system with all the parameters defined by the genetic algorithm. The values after the sign ‘@’ give the cut-offs for the corresponding matrices; ‘<*’ means that the orientation of the matches of the AP-1 matrix should have a definite orientation relative to the matches of the second matrix, whereas the orientation of the NF-AT is not fixed; values ‘3,1..14’ mean that the distance between matches of two matrices should be in the range from 1 to 14 bp and the maximal number of considered pairs in one window is limited to three pairs.

All these values found by the genetic algorithm correspond well to the known nature of NF-AT/AP-1 composite elements in T-cell specific promoters (Kel et al., 1999). The orientation of the matches was found correctly as well, since the crystal structure of the ternary complex of NF-ATp/AP-1-DNA reveals only one valid orientation of NF-ATp factor in this complex (Chen et al., 1998).

Another pair of matrices (V$PU1_Q6V$AP1_Q4) that was found by the program, in fact, corresponds to known composite elements PU.1/AP-1 described in promoters of several genes regulating their expression in immune cells—macrophages (see TRANSCompel acc: C00251 in the promoter of mouse macrosialin gene). Interesting to note that consensus of NF-AT binding sites (AGGAAA) is similar to that of PU.1 binding sites (AGGAAC).

In Figure 2 we show the histogram of the fps score of the revealed composite module in promoters of T-cell specific genes in comparison with the randomly generated sequences. We see clear differentiation of these two sets of sequences. Although not in all promoters we can find sites for both matrix pairs, we still are able to identify the main components of the specific gene regulatory modules in the considered immune cells.

Histograms of the fuzzy scores of the composite promoter model (K1∣K2) computed for the promoters of T-cell specific genes. The majority of the T-cell promoters are characterized by the high values of the fuzzy score with the average value 0.402 (see the fitting Normal distribution).
Fig. 2

Histograms of the fuzzy scores of the composite promoter model (K1∣K2) computed for the promoters of T-cell specific genes. The majority of the T-cell promoters are characterized by the high values of the fuzzy score with the average value 0.402 (see the fitting Normal distribution).

3.2.2 Test 2: 11 Kb upstream regions: T-cell specific genes versus housekeeping genes

In order to demonstrate the ability of the CMA algorithm to analyze long regulatory regions we compiled 5′ regulatory regions of the T-cell specific genes used in the test above. We extracted 11 Kb regions around TSS (10 Kb upstream and 1 Kb downstream). To test the algorithm on the real genomic data we have prepared a background set of 11 Kb upstream regions of 100 housekeeping genes from the list derived from an analysis of public gene expression data (Eisenberg and Levanon, 2003).

By running CMA we again revealed correctly the NFAT/AP-1 composite elements [with parameters of two modules slightly different from the test above: ([email protected]<<[email protected],1,1..16) ([email protected]**[email protected],2,1..16)]. In Figure 3 we show two histograms of the fuzzy promoter scores of the upstream regions of T-cell specific and housekeeping genes, respectively. The two distributions differ significantly, giving rise to the high fitness value = 0.738.

Two histograms of the fuzzy promoter scores of the composite promoter model computed for the 11 Kb upstream regions of T-cell specific genes versus upstream regions of 100 housekeeping genes.
Fig. 3

Two histograms of the fuzzy promoter scores of the composite promoter model computed for the 11 Kb upstream regions of T-cell specific genes versus upstream regions of 100 housekeeping genes.

3.2.3 Permutation test

Since the method introduced here relies on the optimization of parameters of the model with the genetic algorithm there is a risk of overfitting. To evaluate the potential overfitting effect we performed a permutation test. We did 100 random shuffling of the grouping labels of the data used in the Test 2 (11 Kb upstream regions: 20 T-cell specific genes versus 100 housekeeping genes). Running the CMA algorithm 100 times with the randomized data resulted in the distribution of the obtained values of the fitness function that is shown in Figure 4.

Distribution of the observed values of fitness function in 100 runs of the CMA algorithm with the randomized data from Test 2. The value of fitness function obtained in the real data (0.738) is indicated for comparison.
Fig. 4

Distribution of the observed values of fitness function in 100 runs of the CMA algorithm with the randomized data from Test 2. The value of fitness function obtained in the real data (0.738) is indicated for comparison.

As one can see from the distribution, the value of fitness function obtained in the analysis of T-cell specific promoters (0.738; see above) is much higher than expected by random chance. This provides a good evidence against the risk of overfitting and in favor of the realness of the promoter models revealed by the analysis.

3.2.4 Test 3: yeast cell cycle

In another example we performed an analysis of gene expression data on yeast cell cycle taken from Spellman et al. (1998). We selected five sets of genes according to their expression in different cell cycle phases: genes specific for G1, S phases and S/G2, M/G1, G2/M transitional states. We retrieved promoters of the length 1100 bp (−1000, +100) for these genes and applied CMA program to find cell cycle phase-specific composite promoter models.

To estimate whether the results are consistent with the known facts we compared them with the experimental data on ChIP (chromatin immunoprecipitation) analysis and microarray gene expression data summarized in a recent publication (Kato et al., 2004). The parameters of the search were set to reveal a model, which consists of a single CM containing sites for 3–4 single matrices co-localized in a window of 200 bp. The total number of yeast matrices considered was 31 including all yeast matrices from TRANSFAC and several matrices derived from the DNA binding consensi for the corresponding factors published in Kato et al. (2004). In Table 4 we present the list of TFs whose weight matrices have been revealed by CMA as components of the best promoter models and compared these factors with the factors associated with the corresponding cell cycle phase in the mentioned paper (Kato et al., 2004).

Table 4

Comparison of TFs predicted by CMA and experimentally known TFs regulating different yeast cell cycle phases

Table 4

Comparison of TFs predicted by CMA and experimentally known TFs regulating different yeast cell cycle phases

Factors found by CMA for specific cell cycle phases are marked with ‘+’. Gray color of the table cells indicates this factor was shown experimentally to be involved in regulating genes in the corresponding cell cycle phases. Factors FKH1 and FKH2 are very similar to each other in their DNA pattern. That is why the results for these two factors are summarized together. One can see a very good agreement of the predictions made by CMA with the experimental knowledge.

4 CONCLUSION

In this paper we describe a novel method for the analysis and interpretation of gene regulatory regions. The method identifies composite modules—stable combinations of TF binding sites that are shared by the most of the co-regulated promoters. It is generally accepted that such modules are responsible for a function-specific regulation of transcription.

In comparison to most of the previously published approaches that consider combinatorics of TF binding sites, the method described here has several advantages, such as (1) capability to work with data of microarray experiments; (2) optimization not only of the matrix sets, but also of cut-off values for each matrix; (3) analysis of large regulatory regions; (4) search for pairs of matrices, selecting best distance and orientation.

Testing our method on simulated and real data has shown: (1) it is able to correctly reveal CMs that are overrepresented in the set of sequences; (2) it can be used to analyze data and propose factor combinations that are playing key roles in transcriptional regulation in the given biological context. In our previous work we have demonstrated that the combinatorial approach described here allows increasing significantly the precision of the computational prediction of _target genes for TFs (Kel et al., 2004a; Kel et al., 2004b). Application of this approach to the analysis of microarray gene expression data is very promising. The Composite Module Analyst is implemented now as a part of the commercial software system ExPlain™ that provides a wide range of tools and databases for causal interpretation of gene expression data.

The authors would like to thank all co-workers of the BIOBASE GmbH who have been contributing to the integration of CMA into the ExPlain™ system. Parts of the work were funded by grant of the German Ministry of Education and Research (BMBF) together with BioRegioN GmbH ‘BioProfil’, Grant No. 0313092; ‘Intergenomics’ Project, Grant No. 031U110C/031U210C; by European Commission under FP6-‘Life sciences, genomics and biotechnology for health’ contract LSHG-CT-2004-503568 ‘COMBIO’; by European Commission under ‘Marie Curie research training networks’ contract MRTN-CT-2004-512285 ‘TRANSISTOR’ and by INTAS Grant No. 03-51-5218. Funding to pay the Open Access publication charges for this article was provided by BIOBASE GmbH.

Conflict of Interest: none declared.

REFERENCES

Aerts
S.
et al. 
Computational detection of cis -regulatory modules
Bioinformatics
2003
Suppl. 2
(pg. 
II5
-
II14
)
Boehlk
S.
et al. 
ATF and Jun transcription factors, acting through an Ets/CRE promoter module, mediate lipopolysaccharide inducibility of the chemokine RANTES in monocytic Mono Mac 6 cells
Eur. J. Immunol.
2000
, vol. 
30
 (pg. 
1102
-
1112
)
Brazma
A.
Vilo
J.
Ukkonen
E.
Frishman
D.
Mewes
H.W.
Finding transcription factor binding site combinations in yeast genome
1997
Proceedings of the German Conference on Bioinformatics GCB ‘97
Martinsried, Germany
(pg. 
57
-
59
)
Chen
L.
et al. 
Structure of the DNA-binding domains from NFAT, Fos and Jun bound specifically to DNA
Nature.
1998
, vol. 
392
 (pg. 
42
-
48
)
Chen
X.
et al. 
TiProD: The Tissue-specific Promoter Database
Nucleic Acids Res.
2006
, vol. 
34
 (pg. 
D104
-
D107
)
Eisenberg
E.
Levanon
E.Y.
Human housekeeping genes are compact
Trends Genet
2003
, vol. 
19
 (pg. 
362
-
365
)
Eskin
E.
Pevzner
P.A.
Finding composite regulatory patterns in DNA sequences
Bioinformatics
2002
, vol. 
18
 
Suppl. 1
(pg. 
S354
-
S363
)
Fessele
S.
et al. 
Molecular and in silico characterization of a promoter module and C/EBP element that mediate LPS-induced RANTES/CCL5 expression in monocytic cells
FASEB J.
2001
, vol. 
15
 (pg. 
577
-
579
)
Frech
K.
et al. 
Muscle actin genes: a first step towards computational classification of tissue specific promoters
In Silico Biol.
1998
, vol. 
1
 (pg. 
29
-
38
)
Guha
T.D.
Stormo
G.D.
Identifying _target sites for cooperatively binding factors
Bioinformatics
2001
, vol. 
17
 (pg. 
608
-
621
)
Kato
M.
et al. 
Identifying combinatorial regulation of transcription factors and binding motifs
Genome Biology
2004
, vol. 
5
 pg. 
R56
 
Kel
A.E.
et al. 
MATCH: A tool for searching transcription factor binding sites in DNA sequences
Nucleic Acids Res.
2003
, vol. 
31
 (pg. 
3576
-
3579
)
Kel
A.
et al. 
Recognition of NFATp/AP-1 composite elements within genes induced upon the activation of immune cells
J. Mol. Biol.
1999
, vol. 
288
 (pg. 
353
-
376
)
Kel
A.E.
et al. 
Computer-assisted identification of cell cycle-related genes: new _targets for E2F transcription factors
J. Mol. Biol.
2001
, vol. 
309
 (pg. 
99
-
120
)
Kel
A.
et al. 
A novel computational approach for the prediction of networked transcription factors of aryl hydrocarbon-receptor-regulated genes
Mol. Pharmacol.
2004
, vol. 
66
 (pg. 
1557
-
1572
)
Kel
A.E.
Voss
N.
Konovalova
T.
Tchekmenev
D.
Wabnitz
P.
Kel-Margoulis
O.V.
Wingender
E.
Giegerich
R.
Stoye
J.
From composite patters to pathways—prediction of key regulators of gene expression
2004
Proceedings of the German Conference on Bioinformatics (GCB 2004)
Bielefeld, Germany
(pg. 
189
-
198
)
Kel-Margoulis
O.
et al. 
TRANSCompel: a database on composite regulatory elements in eukaryotic genes
Nucleic Acids Res.
2002
, vol. 
30
 (pg. 
332
-
334
)
Kel-Margoulis
O.V.
et al. 
Automatic annotation of genomic regulatory sequences by searching for composite clusters
Pac. Symp. Biocomput.
2002
, vol. 
7
 (pg. 
187
-
198
)
Liu
X.
et al. 
BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes
Pac. Symp. Biocomput.
2001
, vol. 
6
 (pg. 
127
-
138
)
Matys
V.
et al. 
TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes
Nucleic Acids Res.
2006
, vol. 
34
 (pg. 
D108
-
D110
)
Shelest
E.
Wingender
E.
Construction of predictive promoter models on the example of antibacterial response of human epithelial cells
Theor. Biol. Med. Model.
2005
, vol. 
2
 pg. 
2
 
Sinha
S.
et al. 
A probabilistic method to detect regulatory modules
Bioinformatics
2003
, vol. 
19
 
Suppl 1
(pg. 
i292
-
i301
)
Spellman
P.T.
et al. 
Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization
Mol Biol Cell
1998
, vol. 
9
 (pg. 
3273
-
3297
)
Tronche
F.
et al. 
Analysis of the distribution of binding sites for a tissue-specific transcription factor in the vertebrate genome
J. Mol. Biol.
1997
, vol. 
266
 (pg. 
231
-
245
)
van Helden
J.
et al. 
Discovering regulatory elements in non-coding sequences by analysis of spaced dyads
Nucleic Acids Res.
2000
, vol. 
28
 (pg. 
1808
-
1818
)
Wasserman
W.W.
Fickett
J.W.
Identification of regulatory regions which confer muscle-specific gene expression
J. Mol. Biol.
1998
, vol. 
278
 (pg. 
167
-
181
)

Author notes

Associate Editor: Thomas Lengauer

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected]