Promoter prediction softwares




















These regions are identified in genome annotation projects by mapping repeats and available EST sequences onto a genome or applying computation gene-finding approaches. After that we can search for promoters using precise methods in sequences just before the mapped EST or predicted coding regions within intergenic space. This avoids scanning the whole-genome sequence and as a result we achieve an acceptable level of false positive predictions.

Wasserman et al. We used several types of conservative blocks to enhance the sensitivity and specificity of the TSSW algorithm, providing pairs of aligned orthologous genomic sequences as input data. Recently, the draft sequence of the mouse genome 19 and a gene expression map of human chromosome 21 orthologs in the mouse 20 have been reported. By exploiting conservative elements in pairs of orthologous genes of human and related species, the PromH program was developed previously The program correctly predicted TSS for all 21 genes of the TATA-promoter test set with a median deviation of 2 bp from true site location.

Only for two genes, there was a significant 46 and bp discrepancy between predicted and annotated TSS positions. While requiring the input of pairs of orthologous sequences, such an approach demonstrated better accuracy on rather limited test data than currently available promoter predictors annotating single sequences. However, it is important to improve the accuracy of promoter prediction on single sequences due to the frequent lack of information about sequences of orthologous genes.

Moreover, no promoter prediction tool has been trained and adapted for plants. They conclude that such islands do not provide a straightforward indicator of promoter location, but such features can be used as a component of a more sophisticated promoter predictor.

Here, we investigate a new learning and discriminative technique called the transductive confidence machine TCM , which has been trained and tested on independent sets of well-known promoters. The method presented in the paper allows us not just to make predictions, but more importantly, it also gives valid measures of confidence in the predictions for each individual example in the test set.

Moreover, the method is flexible in the sense that it can be used with almost all known classifiers, such as support vector machine SVM , Decision Trees and others. The method can be applied to high-dimensional data and requires just one assumption: the examples are assumed to be independent and identically distributed the iid assumption. Some characteristics of promoters such as the density of functional motifs do not follow normal distribution, which was a limitation of the discriminant analysis approach we used for promoter prediction in previous works 11 , The length of all promoter and non-promoter sequences was bp.

We make only one assumption about the data generating mechanism: all the examples have been generated independently by some fixed but unknown stochastic mechanism the iid assumption. The method called TCM or conformal predictors is based on the recently developed 23 — 26 computable approximation to algorithmic randomness.

The randomness level is a universal measure of typicalness with respect to the class of iid distributions; if the randomness level is close to 0, it is untypical or strange Here, the optimal algorithm for making predictions is complemented by some measures of confidence and credibility 24 , 27 :.

Predict the set Y 1 ,…, Y k corresponding to the completion with the largest randomness level. Output as the confidence in this prediction one minus the second largest randomness level. Output as the credibility of this prediction the randomness level of the output prediction Y 1 ,…, Y k i. Credibility reflects how well our new example fits into our training set.

Intuitively, low credibility means that either the training set is non-random or the test examples are not representative of the training set One of the advantages of this newly developed algorithm is its flexibility: almost all machine learning techniques can be used for prediction.

One way to approximate randomness level is to use the SVM Consider, for simplicity, the problem of binary classification with one test example. A SVM maps the original set of vectors into a high-dimensional feature space, and then constructs a linear separating hyperplane or a linear regression function, in the regression case in this feature space.

This was done in Gammerman et al. These approximations can be used to assess the randomness level. It is reflected in P -values we compute in the assumption that the data follow iid distribution. For the characterization of promoter sequences, we use the sequence content and signal features that were found in our previous works as being significantly different in promoter and non-promoter sequences 11 , The values of Mahalanobis distances D 2 of individual characteristics reflect the power of the feature to separate the signal from non-signal sequences Table 1.

Probably TATA-less promoters possess a more gene-specific structure and they will be extremely difficult to predict by any general-purpose methods. Earlier 11 , 21 discriminant analysis was applied to combine these features in the linear discriminant function. In this work, we applied a more powerful pattern recognition technique that requires just one assumption: the examples are independent and identically distributed.

We trained our SVM to distinguish between promoter and non-promoter sequences using features discussed above. The TCM procedure also provides a credibility measure for different predictions.

If for any task the consequences of making a wrong prediction are quite severe, we can choose to reject those predictions that have a low credibility value associated with them. The more severe the consequences for making an incorrect prediction are, the higher we can set the rejection threshold.

For any pair of predicted TSS, located within bp of each other, only the one with the highest credibility value is retained. In every such training and testing procedure, randomly created sets of non-promoter sequences and the same known 40 TATA and 25 TATA-less promoters were used. Accuracy of recognition is presented in Table 2. Interestingly, for TATA and TATA-less promoters the error rate of testing promoter and non-promoter sequences of nt length is higher when using negative samples from introns and CDSs, respectively.

The real task of promoter prediction is slightly different from just discriminating between promoter and non-promoter regions. We should try to identify the most probable promoter location in a long genomic sequence. In general, sequences When we analyzed the negative data set constructed with random sequences , we do not observe a clear clustering since each tool presented a different level of FP results, with the lowest level observed for iProFMWin Fig.

In this case, only sequences It is worth mentioning that the three best tools CNNProm, iProFMWin, and 70ProPred are from to , indicating that, as expected, promoter prediction algorithms are evolving through the years. Taken together, these results indicate that four out of eight tools analyzed here display equivalent predicting power to identify true promoter sequences, while the widely used tool BPROM exhibits a reduced predictive capability. Analysis of tool performance in the positive data set natural sequences.

A Hierarchical clustering of DNA sequences classified as promoters blue or nonpromoters black. B Venn diagram representing the number of sequences predicted as promoters from panel A. Analysis of tool performance on the negative data set random sequences. Hierarchical clustering of DNA sequences classified as promoters blue or nonpromoters black.

As presented above, we observed a high degree of similarity between the best tools for the identification of true promoters, but a lower overlap on random sequences equivocally classified as promoters. This could indicate that each algorithm might identify different features to assign a sequence as a promoter.

To further investigate this process, we analyzed the information content from the sequences identified as promoters from the positive and negative data sets for the top five tools analyzed here. The results of these analyses are presented as sequence logos in Fig. It is worth noticing that the information content was higher for iProFMWin up to 0. S1 in the supplemental material. This implies that these tools are sensitive to AT content, which makes sense since iPromoter-2L and CNNProm were trained on coding sequences as negative controls 34 , This might be explained by these tools classifying sequences that resemble true promoters, and we could not rule out the possibility that some of these random sequences could in fact display promoter activity in E.

Taken together, these results indicate that high rates of FP results observed for some of these algorithms could be due to the use of unrealistic control sequences such as coding regions that could make the algorithms sensitive to AT-rich regions, highlighting the importance of choosing appropriate nonpromoter sequences to train these tools.

Analysis of the information content of DNA sequences identified as promoters on the positive data set natural sequences. Analysis of the information content of sequences identified as promoters on the negative data set random sequences. In this work, we performed a benchmark analysis of the performance of promoter prediction tools using a well-characterized promoter sequence and random sequences. As can be seen from the results above, new tools have emerged with enhanced performance compared to widely used ones.

Although the best performing tool uses just sequence-based features a result that corroborates with Abbas et al. It is also clear from our results that choosing the appropriate control or negative data set to construct these algorithms is crucial to avoid false-positive results.

Therefore, coding sequences or sequences with different features render the tools AT sensitive, increasing the false-positive rate. Furthermore, we still need an experimentally well-validated nonpromoter data set to faithfully use as negative controls in these predictions, but these sequences are not available yet.

In this sense, we expect that the growing number of high-throughput experiments could become a great source of data to create novel data sets to train new tools for promoter prediction in the future. Another complication to this subject comes from recent evidence showing that just one mutation in random sequences could lead to constitutive transcription in vivo , indicating that transcription is indeed a robust process Additionally, several sources of prior information could be incorporated into prediction methods to improve the final tools.

For instance, the interrelation between the UP upstream promoter element and a subunit of RNAP was found to play a role in transcription initiation and promoter activity 44 and switch preference of sigma factors in promoters Additionally, more than proteins in E. These proteins could thus impact promoter activity in vivo , and their binding sequence preferences could influence promoter discovery.

A putative model for a bacterial promoter region, including a range of experimental attributes. These regions can have positive blue regions or negative red regions effect on promoter activity.

A notable characteristic shared by the works mentioned here is that all the available prediction tools perform only binary classifications, i. Therefore, there is no indication of an activity threshold to classify a given sequence as a promoter, and it is known that expression levels of different bacterial transcripts vary on a wide range of magnitude order However, there have been some attempts in the literature to perform some regression analysis instead of binary classification only.

They also added variance in sequences that surround the core promoter and that may play a role in promoter activity. Performing a fluorescence assay to measure promoter activity and applying a partial least squares regression model, they attempt to predict promoter strength for sigma Yet, only 78 variants were characterized in this experimental design, and more variants are needed to train an accurate model. Similarly, Rhodius et al. Also, a spacer and discriminator length penalty score was added.

Notably, in vivo and in vitro expression was measured in their work, and promoter activity was also tested by a function of sigma E concentration. Additionally, partial least squares regression was used to predict promoter activity This approach is useful to find the elements in a given promoter sequence, and by using cross-validation, the results appear to be promising, despite the small size of the data set and the use of PWM a model that showed poor predictive results on our work.

Moreover, instead of using position weight matrices, energy matrices are being successfully built to represent the sequence-dependent binding energy using sequence libraries with a large number of variants followed by Sort-seq experiments flow cytometry, sorting, and next-generation sequencing 6 , and therefore, these energy matrices are being employed to model promoter activity Urtecho et al.

The authors have integrated their expression cassette on different genomic locations and have investigated its effects, applying a well-suited method for expression normalization.

Therefore, their approach explained most of the variance in promoter activity, as well as discovered nonlinear interactions between promoter elements by employing neural networks Despite the limitation of a data set with discrete characteristics, this approach presents a reliable method to predict promoter strength in a well-defined context, but application of these methods to natural systems has still to be demonstrated. One final remark is that the majority of algorithms have been created using data sets of promoters from just one bacterium, E.

Consequently, since each organism has its particularities in terms of DNA binding proteins and sigma factor elements, we are still far away from having a prediction tool that can be used for several organisms. To accomplish that, we would require extensive promoter data sets from several microorganisms to construct multipurpose prediction tools.

Last, we hope the approach and metrics used here can contribute to future studies aimed to construct improved promoter prediction tools. We started this work by searching in the literature for recent and available prediction tools for E. For each case, when a tool was available online or by software download, we selected it for posterior analysis.

Table 1 shows the summarized information about the tool methodology i. All these descriptions have been extracted from the original papers describing the tools.

Next, we analyzed some usability features of the tools such as the file format accepted as input, maximal allowed file size, the output format of results, etc. Then, we selected the ones that accepted our complete data set in multifasta format as input to perform a comparative analysis.

To compare each selected tool, we used an experimentally validated promoter data set for the well-studied E. We used only sigma dependent promoters since they are mostly well-characterized in bacteria, and consequently, most tools have been developed to recognize this class of elements.

Additionally, we used a negative promoter set consisting of 1, randomly generated sequences with a nucleotide distribution similar to that encountered in the natural sequences, which was constructed with an ad hoc script written in Python. Also, it is important to stress that many tools, such as BPROM, 70ProPred, and iProFMWin, used coding and intergenic regions as control negative sequences, but this is not appropriate since coding and noncoding regions have different nucleotide compositions and structural properties 55 , In the case where the tool required the entire genome, we used the E.

The two data sets natural and random used here are available as Data Sets S1 and S2 in the supplemental material.

Then, we obtain a random number between 0 and 1 and check the interval this value belongs to pick a given random nucleotide. Thus, the results were evaluated comparing the accuracy and Matthews correlation coefficient MCC 57 , calculated as the following equations:. We adopted MCC because it is a metric that deals with unbalanced data sets i. It achieves high scores only if TP and TN are high, considering both types of correct classification in a single metric, and it has been shown that for this type of binary classification e.

Sensitivity and specificity scores were also used to give a sense of correct classification of promoters and nonpromoters and are defined as follows:.

By testing the tools with our synthetic random data set, we can measure whether those tools are overfitting their test data sets, and by testing our positive data set with strong experimental evidence , we are measuring underfitting, once some of our positive sequences probably have already been used to train the tool's algorithms As some of the tools also predict promoters for other sigma factors, to be able to classify all predictions as correct or wrong, we considered random sequences classified as any sigma class promoter as FP and a sigma 70 sequence classified as any other class of sigma promoter as FN.

This does not mean that a sigma70 promoter classified as another sigma factor cannot respond to this sigma or even to sigma 70, in vivo , as we discuss later. For data representation, heatmaps were created by using the R package Heatmap. The logos of count matrices, probability matrices, position weight matrices, and information matrices were constructed by using Logomaker Python library As every result generated by the tools has different formats, these were preprocessed using a text editor or ad hoc Python scripts.

The data sets used are available for download as files in the supplemental material. All authors read and approved the final manuscript. The review history of this article can be read here.

National Center for Biotechnology Information , U. Journal List mSystems v. Published online Aug Krizhevsky, A. Pereira, C. Burges, L. Bottou, and K. Lander, E. M, Birren, B. Initial sequencing and analysis of the human genome. Nature , — LeCun, Y. Deep learning. Lin, H. Identifying sigma70 promoters with novel pseudo nucleotide composition. Matsumine, H. A microdeletion of d6s in a family of autosomal recessive juvenile parkinsonism park2.

Genomics 49, — Nazari, I. Branch point selection in rna splicing using deep learning. IEEE Access. Ohler, U. Interpolated markov chains for eukaryotic promoter recognition.

Oubounyt, M. Deep learning models based on distributed feature representations for alternative splicing prediction. IEEE Access 6, — The eukaryotic promoter database epd. Ponger, L. Cpgprod: identifying cpg islands associated with transcription start sites in large genomic mammalian sequences.

Bioinformatics 18, — Prestridge, D. Predicting pol ii promoter sequences using transcription factor binding sites. Qian, Y. Quang, D. Danq: a hybrid convolutional and recurrent deep neural network for quantifying the function of dna sequences. Reese, M. Application of a time-delay neural network to promoter annotation in the drosophila melanogaster genome. Scherf, M. Highly specific localization of promoter regions in large genomic sequences by promoterinspector: a novel context analysis approach1.

Schmidhuber, J. Deep learning in neural networks: an overview. Neural Netw. Schuster, M. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. Shi, W. Frequency distribution of tata box and extension sequences on human promoters. BMC Bioinformat. Smale, S. The rna polymerase ii core promoter. Szegedy, C. Tahir, M. Umarov, R. Promoter analysis and prediction in the human genome using sequence-based deep learning models. Bioinformatics bty Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.

Wei, L. Prediction of human protein subcellular localization using deep learning. Improved and promising identification of human micrornas by incorporating a high-quality negative set. Xiao, X. Xu, M. Core promoter-specific gene regulation: Tata box selectivity and initiator-dependent bi-directionality of serum response factor-activated transcription. Yang, C. Prevalence of the initiator over the tata box in human and yeast genes and identification of dna motifs enriched in human tata-less core promoters.

Gene , 52— Yang, Y. Exploiting sequence-based features for predicting enhancer—promoter interactions. Bioinformatics 33, i—i Zhang, J. The impact of next-generation sequencing on genomics. Keywords: promoter, DeePromoter, bioinformatics, deep learning, convolutional neural network. Feklistov A. Gordon J. Bioinformatics , 22 , — Gordon L. Bioinformatics , 19 , — Gruber T.

Hertz G. Methods Enzymol. Huerta A. Imamura S. Gene Regul. Jihoon Y. IJCNN ' Karp P. Nucleic acids research , 30 , 56 — Kilic S. Knudsen S. Bioinformatics , 15 , — IEEE Trans. Man Cybern.

C Appl. Mann S. Mitschke J. Panyukov V. PLoS One , 8 , e Rangannan V. Rani T. In Silico Biol. Reese M. Roy A. Trends Biochem.

Ruff E. Biomolecules , 5 , — Salgado H. Schneider G. Shahmuradov I. Solovyev V. In: Li RW ed , Metagenomics and its applications in agriculture, biomedicine and environmental studies. Song K.

Song W. FEMS Microbiol. Stormo G. Bioinformatics , 16 , 16 — Studholme D. Vijayan V. Genome Biol. Wosten M. Oxford University Press is a department of the University of Oxford.

It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide. Sign In or Create an Account. Sign In. Advanced Search. Search Menu. Article Navigation. Close mobile search navigation Article Navigation. Volume Article Contents Abstract.

To whom correspondence should be addressed. Email: ilham. Oxford Academic. Rozaimi Mohamad Razali. Salim Bougouffa. Aleksandar Radovanovic.



0コメント

  • 1000 / 1000