Composite Module Analyst: a fitness-based tool for identification of transcription factor binding site combinations

Revealing combination of three single matrices (V$AP4_01, V$E2F_02, V$GATA1_02) in the window of 200 bp

Site scores	Probability of site insertion
	0.9	0.6	0.3
High	+++	+++	+++
Low	+++	++−	++−

Site scores	Probability of site insertion
High	+++	+++	+++
Low	+++	++−	++−

We implanted sites into 30% of the sequences (set A); other sequences in the set are taken as background (set B). Scores of the implanted sites are high (optimum to reduce false positives) or low (optimum to reduce false negatives). Sites were implanted probabilistically, so the probability that a site is implanted in a given sequence (P) varies from 0.3 to 0.9. +++ indicates that all implanted matrices were correctly identified back, ++− that only two were identified correctly and one was wrongly identified, in 100 iterations; population size was 200.

Table 1

Revealing combination of three single matrices (V$AP4_01, V$E2F_02, V$GATA1_02) in the window of 200 bp

Site scores	Probability of site insertion
	0.9	0.6	0.3
High	+++	+++	+++
Low	+++	++−	++−

Site scores	Probability of site insertion
High	+++	+++	+++
Low	+++	++−	++−

The result of this simulation shows that the CMA program is able to determine implanted matrices correctly in most cases, even if just a few sequences of set A contain all of the sites of the module (e.g. for P = 0.3, only about 8 sequences out of 300 contain all 3 sites).

Microarray data usually contain a high level of noise and reveal no clear differentiation between ‘changed’ and ‘unchanged’ genes. The expression values vary considerably. In order to test the ability of our method to deal with such kind of noisy data we generated test data with various ratios of differential expression values (e) and the random variance of the expression value Δe (as a measure of noise). The result of this simulation (Table 2) shows that the CMA was able to reveal correct CM in cases of Δe < e and Δe ∼ e, although the fitness is decreasing in the second case. In case of Δe > e the method was able to reveal correctly only two matrices used for the sites implantation.

Table 2

Testing the CMA program to analyze noisy expression data

Expression parameters	Fitness (Z)	Random fitness	CM
Δe < e (x = 5)	0.6	0.0045 ± 0.0035	+++
Δe ∼ e (x = 10)	0.4	0.0029 ± 0.036	+++
Δe > e (x = 15)	0.3	0.0032 ± 0.0028	++−

Expression parameters	Fitness (Z)	Random fitness	CM
Δe < e (x = 5)	0.6	0.0045 ± 0.0035	+++
Δe ∼ e (x = 10)	0.4	0.0029 ± 0.036	+++
Δe > e (x = 15)	0.3	0.0032 ± 0.0028	++−

‘Changed genes’ comprised 30% of the random sequences. Sites were implanted in these sequences only (P = 0.9). Matrices were the same as in the Table 1. High cut-offs were used for implantation. We assigned to each ‘changed gene’ an expression value randomly generated from the interval [10, 10 + x]. Other ‘non-changed’ genes got the lower expression value from the interval [0,x]. By varying x = 5, 10, 15 we simulated 3 variants of the data with gradually increased noise. ‘Random fitness’—values of fitness function obtained in the shuffling experiments.

Table 2

Testing the CMA program to analyze noisy expression data

Expression parameters	Fitness (Z)	Random fitness	CM
Δe < e (x = 5)	0.6	0.0045 ± 0.0035	+++
Δe ∼ e (x = 10)	0.4	0.0029 ± 0.036	+++
Δe > e (x = 15)	0.3	0.0032 ± 0.0028	++−

Expression parameters	Fitness (Z)	Random fitness	CM
Δe < e (x = 5)	0.6	0.0045 ± 0.0035	+++
Δe ∼ e (x = 10)	0.4	0.0029 ± 0.036	+++
Δe > e (x = 15)	0.3	0.0032 ± 0.0028	++−

In order to estimate the significance of the fitness values obtained in the analysis we have performed multiple shuffling experiments. In each such experiment we took all the expression values in the set and reassigned them to randomly chosen sequences. After that, we applied CMA to these shuffled samples and computed the average and standard deviation of the observed fitness values (Table 2, ‘random fitness’).

Next, we tested the functionality of the program on revealing pairs of matrices that reflect composite elements composed of two closely situated sites. One pair of sites was implanted into the 30% of random sequences (pair: V$AP4_01/V$MEF2_01; distance vary: 5–25; cut-offs ‘high’). After that, we tried to reveal this CM back by varying the parameters of the search (Table 3).

Table 3

Testing the CMA program to reveal implanted pairs of matrices

Parameters of the search of the CM structure	Probability of site insertion
	0.9	0.6	0.3
1 pair	(++)	(++)	(++)
1–3 pair	(++)	(++)	(++)
0–1 pair, 0–2 single matrices	(++)	++	++

Parameters of the search of the CM structure	Probability of site insertion
	0.9	0.6	0.3
1 pair	(++)	(++)	(++)
1–3 pair	(++)	(++)	(++)
0–1 pair, 0–2 single matrices	(++)	++	++

(++) indicates that the correct pair was found, ++ that two single matrices were revealed and they are correct components of the pair.

Table 3

Testing the CMA program to reveal implanted pairs of matrices

Parameters of the search of the CM structure	Probability of site insertion
	0.9	0.6	0.3
1 pair	(++)	(++)	(++)
1–3 pair	(++)	(++)	(++)
0–1 pair, 0–2 single matrices	(++)	++	++

Parameters of the search of the CM structure	Probability of site insertion
	0.9	0.6	0.3
1 pair	(++)	(++)	(++)
1–3 pair	(++)	(++)	(++)
0–1 pair, 0–2 single matrices	(++)	++	++

(++) indicates that the correct pair was found, ++ that two single matrices were revealed and they are correct components of the pair.

When the frequency of the pairs in the sequences is high (P = 0.9), the program is able to reveal the correct matrix pair under a wide range of search parameters. If the frequency is low, it becomes more difficult to predict the correct structure of the CM without sufficient knowledge on the expected structure which can help to set optimal parameters of the search. Anyway, in most of the cases the program is able to predict correctly the components of the CM even if its fine structure is not correctly predicted.

3.2 Analysis of promoters of co-regulated genes with the help of CMA

3.2.1 Test 1: T-cell specific genes

In order to check the ability of CMA to reveal known composite promoter modules we analyzed a set of promoters of T-cell specific genes that have been shown to be regulated by a very specific type of composite elements: NF-AT/AP-1. The set includes genes for several interleukins/(IL-1,4,5,8); their receptors, signaling molecules such as TNF-alpha, IFN and others. It is based on the collection of experimentally proven NF-AT/AP-1 composite elements in TRANSCompel database (Kel-Margoulis et al., 2002a). In our earlier work (Kel et al., 1999) we showed that the promoters of these genes contain many copies of the NF-AT/AP-1 composite elements. Here, we would like to check if CMA would be able to reveal these composite elements without having explicit knowledge about their composition. Results of this test are given in Figure 1.

Fig. 1

Example of CMA output of the composite promoter model revealed in T-cell specific promoters by optimization of the fitness function with the genetic algorithm. Set A contains 26 promoters of an average length 1000 bp; background set B contains 250 randomly generated sequences with the same nucleotide composition as in the set A. The promoter model found by the program consists of a Boolean function (Predicate) of two small CMs (K1, K2) connected by the logical OR. Each CM is represented by a pair of matrices found in a window of 50 bp. NF-AT/AP-1 pair is correctly revealed by the program.

The (V$AP1_Q4@0.872000<*V$NFAT_Q6@0.791000,3,1..14) line in the output describes a pair of matrices found by the system with all the parameters defined by the genetic algorithm. The values after the sign ‘@’ give the cut-offs for the corresponding matrices; ‘<*’ means that the orientation of the matches of the AP-1 matrix should have a definite orientation relative to the matches of the second matrix, whereas the orientation of the NF-AT is not fixed; values ‘3,1..14’ mean that the distance between matches of two matrices should be in the range from 1 to 14 bp and the maximal number of considered pairs in one window is limited to three pairs.

All these values found by the genetic algorithm correspond well to the known nature of NF-AT/AP-1 composite elements in T-cell specific promoters (Kel et al., 1999). The orientation of the matches was found correctly as well, since the crystal structure of the ternary complex of NF-ATp/AP-1-DNA reveals only one valid orientation of NF-ATp factor in this complex (Chen et al., 1998).

Another pair of matrices (V$PU1_Q6 − V$AP1_Q4) that was found by the program, in fact, corresponds to known composite elements PU.1/AP-1 described in promoters of several genes regulating their expression in immune cells—macrophages (see TRANSCompel acc: C00251 in the promoter of mouse macrosialin gene). Interesting to note that consensus of NF-AT binding sites (AGGAAA) is similar to that of PU.1 binding sites (AGGAAC).

In Figure 2 we show the histogram of the fps score of the revealed composite module in promoters of T-cell specific genes in comparison with the randomly generated sequences. We see clear differentiation of these two sets of sequences. Although not in all promoters we can find sites for both matrix pairs, we still are able to identify the main components of the specific gene regulatory modules in the considered immune cells.

Fig. 2

Histograms of the fuzzy scores of the composite promoter model (K1∣K2) computed for the promoters of T-cell specific genes. The majority of the T-cell promoters are characterized by the high values of the fuzzy score with the average value 0.402 (see the fitting Normal distribution).

3.2.2 Test 2: 11 Kb upstream regions: T-cell specific genes versus housekeeping genes

In order to demonstrate the ability of the CMA algorithm to analyze long regulatory regions we compiled 5′ regulatory regions of the T-cell specific genes used in the test above. We extracted 11 Kb regions around TSS (10 Kb upstream and 1 Kb downstream). To test the algorithm on the real genomic data we have prepared a background set of 11 Kb upstream regions of 100 housekeeping genes from the list derived from an analysis of public gene expression data (Eisenberg and Levanon, 2003).

By running CMA we again revealed correctly the NFAT/AP-1 composite elements [with parameters of two modules slightly different from the test above: ([email protected]<<[email protected],1,1..16) ∣ ([email protected]**[email protected],2,1..16)]. In Figure 3 we show two histograms of the fuzzy promoter scores of the upstream regions of T-cell specific and housekeeping genes, respectively. The two distributions differ significantly, giving rise to the high fitness value = 0.738.

Fig. 3

Two histograms of the fuzzy promoter scores of the composite promoter model computed for the 11 Kb upstream regions of T-cell specific genes versus upstream regions of 100 housekeeping genes.

3.2.3 Permutation test

Since the method introduced here relies on the optimization of parameters of the model with the genetic algorithm there is a risk of overfitting. To evaluate the potential overfitting effect we performed a permutation test. We did 100 random shuffling of the grouping labels of the data used in the Test 2 (11 Kb upstream regions: 20 T-cell specific genes versus 100 housekeeping genes). Running the CMA algorithm 100 times with the randomized data resulted in the distribution of the obtained values of the fitness function that is shown in Figure 4.

Fig. 4

Distribution of the observed values of fitness function in 100 runs of the CMA algorithm with the randomized data from Test 2. The value of fitness function obtained in the real data (0.738) is indicated for comparison.

As one can see from the distribution, the value of fitness function obtained in the analysis of T-cell specific promoters (0.738; see above) is much higher than expected by random chance. This provides a good evidence against the risk of overfitting and in favor of the realness of the promoter models revealed by the analysis.

3.2.4 Test 3: yeast cell cycle

In another example we performed an analysis of gene expression data on yeast cell cycle taken from Spellman et al. (1998). We selected five sets of genes according to their expression in different cell cycle phases: genes specific for G₁, S phases and S/G₂, M/G₁, G₂/M transitional states. We retrieved promoters of the length 1100 bp (−1000, +100) for these genes and applied CMA program to find cell cycle phase-specific composite promoter models.

To estimate whether the results are consistent with the known facts we compared them with the experimental data on ChIP (chromatin immunoprecipitation) analysis and microarray gene expression data summarized in a recent publication (Kato et al., 2004). The parameters of the search were set to reveal a model, which consists of a single CM containing sites for 3–4 single matrices co-localized in a window of 200 bp. The total number of yeast matrices considered was 31 including all yeast matrices from TRANSFAC and several matrices derived from the DNA binding consensi for the corresponding factors published in Kato et al. (2004). In Table 4 we present the list of TFs whose weight matrices have been revealed by CMA as components of the best promoter models and compared these factors with the factors associated with the corresponding cell cycle phase in the mentioned paper (Kato et al., 2004).

Table 4

Comparison of TFs predicted by CMA and experimentally known TFs regulating different yeast cell cycle phases

Factors found by CMA for specific cell cycle phases are marked with ‘+’. Gray color of the table cells indicates this factor was shown experimentally to be involved in regulating genes in the corresponding cell cycle phases. Factors FKH1 and FKH2 are very similar to each other in their DNA pattern. That is why the results for these two factors are summarized together. One can see a very good agreement of the predictions made by CMA with the experimental knowledge.

4 CONCLUSION

In this paper we describe a novel method for the analysis and interpretation of gene regulatory regions. The method identifies composite modules—stable combinations of TF binding sites that are shared by the most of the co-regulated promoters. It is generally accepted that such modules are responsible for a function-specific regulation of transcription.

In comparison to most of the previously published approaches that consider combinatorics of TF binding sites, the method described here has several advantages, such as (1) capability to work with data of microarray experiments; (2) optimization not only of the matrix sets, but also of cut-off values for each matrix; (3) analysis of large regulatory regions; (4) search for pairs of matrices, selecting best distance and orientation.

Testing our method on simulated and real data has shown: (1) it is able to correctly reveal CMs that are overrepresented in the set of sequences; (2) it can be used to analyze data and propose factor combinations that are playing key roles in transcriptional regulation in the given biological context. In our previous work we have demonstrated that the combinatorial approach described here allows increasing significantly the precision of the computational prediction of _target genes for TFs (Kel et al., 2004a; Kel et al., 2004b). Application of this approach to the analysis of microarray gene expression data is very promising. The Composite Module Analyst is implemented now as a part of the commercial software system ExPlain™ that provides a wide range of tools and databases for causal interpretation of gene expression data.

The authors would like to thank all co-workers of the BIOBASE GmbH who have been contributing to the integration of CMA into the ExPlain™ system. Parts of the work were funded by grant of the German Ministry of Education and Research (BMBF) together with BioRegioN GmbH ‘BioProfil’, Grant No. 0313092; ‘Intergenomics’ Project, Grant No. 031U110C/031U210C; by European Commission under FP6-‘Life sciences, genomics and biotechnology for health’ contract LSHG-CT-2004-503568 ‘COMBIO’; by European Commission under ‘Marie Curie research training networks’ contract MRTN-CT-2004-512285 ‘TRANSISTOR’ and by INTAS Grant No. 03-51-5218. Funding to pay the Open Access publication charges for this article was provided by BIOBASE GmbH.

Conflict of Interest: none declared.

REFERENCES

Aerts

et al.

Computational detection of cis -regulatory modules

Bioinformatics

2003

Suppl. 2

(pg.

II5

II14

)

Boehlk

et al.

ATF and Jun transcription factors, acting through an Ets/CRE promoter module, mediate lipopolysaccharide inducibility of the chemokine RANTES in monocytic Mono Mac 6 cells

Eur. J. Immunol.

2000

, vol.

(pg.

1102

1112

)

Brazma

Vilo

Ukkonen

Frishman

Mewes

H.W.

Finding transcription factor binding site combinations in yeast genome

1997

Proceedings of the German Conference on Bioinformatics GCB ‘97

Martinsried, Germany

(pg.

)

Chen

et al.

Structure of the DNA-binding domains from NFAT, Fos and Jun bound specifically to DNA

Nature.

1998

, vol.

392

(pg.

)

Chen

et al.

TiProD: The Tissue-specific Promoter Database

Nucleic Acids Res.

2006

, vol.

(pg.

D104

D107

)

Eisenberg

Levanon

E.Y.

Human housekeeping genes are compact

Trends Genet

2003

, vol.

(pg.

362

365

)

Eskin

Pevzner

P.A.

Finding composite regulatory patterns in DNA sequences

Bioinformatics

2002

, vol.

Suppl. 1

(pg.

S354

S363

)

Fessele

et al.

Molecular and in silico characterization of a promoter module and C/EBP element that mediate LPS-induced RANTES/CCL5 expression in monocytic cells

FASEB J.

2001

, vol.

(pg.

577

579

)

Frech

et al.

Muscle actin genes: a first step towards computational classification of tissue specific promoters

In Silico Biol.

1998

, vol.

(pg.

)

PubMed

Guha

T.D.

Stormo

G.D.

Identifying _target sites for cooperatively binding factors

Bioinformatics

2001

, vol.

(pg.

608

621

)

PubMed

Kato

et al.

Identifying combinatorial regulation of transcription factors and binding motifs

Genome Biology

2004

, vol.

pg.

R56

Kel

A.E.

et al.

MATCH: A tool for searching transcription factor binding sites in DNA sequences

Nucleic Acids Res.

2003

, vol.

(pg.

3576

3579

)

Kel

et al.

Recognition of NFATp/AP-1 composite elements within genes induced upon the activation of immune cells

J. Mol. Biol.

1999

, vol.

288

(pg.

353

376

)

Kel

A.E.

et al.

Computer-assisted identification of cell cycle-related genes: new _targets for E2F transcription factors

J. Mol. Biol.

2001

, vol.

309

(pg.

120

)

Kel

et al.

A novel computational approach for the prediction of networked transcription factors of aryl hydrocarbon-receptor-regulated genes

Mol. Pharmacol.

2004

, vol.

(pg.

1557

1572

)

Kel

A.E.

Voss

Konovalova

Tchekmenev

Wabnitz

Kel-Margoulis

O.V.

Wingender

Giegerich

Stoye

From composite patters to pathways—prediction of key regulators of gene expression

2004

Proceedings of the German Conference on Bioinformatics (GCB 2004)

Bielefeld, Germany

(pg.

189

198

)

Kel-Margoulis

et al.

TRANSCompel: a database on composite regulatory elements in eukaryotic genes

Nucleic Acids Res.

2002

, vol.

(pg.

332

334

)

Kel-Margoulis

O.V.

et al.

Automatic annotation of genomic regulatory sequences by searching for composite clusters

Pac. Symp. Biocomput.

2002

, vol.

(pg.

187

198

)

Liu

et al.

BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes

Pac. Symp. Biocomput.

2001

, vol.

(pg.

127

138

)