This Title All WIREs
How to cite this WIREs title:
Impact Factor: 9.957

Evolutionary conservation of RNA sequence and structure

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

Abstract An RNA structure prediction from a single‐sequence RNA folding program is not evidence for an RNA whose structure is important for function. Random sequences have plausible and complex predicted structures not easily distinguishable from those of structural RNAs. How to tell when an RNA has a conserved structure is a question that requires looking at the evolutionary signature left by the conserved RNA. This question is important not just for long noncoding RNAs which usually lack an identified function, but also for RNA binding protein motifs which can be single stranded RNAs or structures. Here we review recent advances using sequence and structural analysis to determine when RNA structure is conserved or not. Although covariation measures assess structural RNA conservation, one must distinguish covariation due to RNA structure from covariation due to independent phylogenetic substitutions. We review a statistical test to measure false positives expected under the null hypothesis of phylogenetic covariation alone (specificity). We also review a complementary test that measures power, that is, expected covariation derived from sequence variation alone (sensitivity). Power in the absence of covariation signals the absence of a conserved RNA structure. We analyze artifacts that falsely identify conserved RNA structure such as the misuse of programs that do not assess significance, the use of inappropriate statistics confounded by signals other than covariation, or misalignments that induce spurious covariation. Among artifacts that obscure the signal of a conserved RNA structure, we discuss the inclusion of pseudogenes in alignments which increase power but destroy covariation. This article is categorized under: RNA Structure and Dynamics > RNA Structure, Dynamics and Chemistry RNA Evolution and Genomics > Computational Analyses of RNA RNA Evolution and Genomics > RNA and Ribonucleoprotein Evolution
RNA structure prediction from a single sequence with and without chemical modification data. (a) Sequence and structure of a SAM‐I riboswitch from Thermoanaerobacter tengcongesis determined by X‐ray crystallography (Montange & Batey, 2006). Watson–Crick base pairs are depicted with a black line. The SAM‐I structure includes a pseudoknot and a kink‐turn motif with three non‐Watson–Crick A‐G pairs. (b–d) Secondary structure predictions for the SAM‐I riboswitch by the software ViennaRNA (program: RNAfold) (Gruber et al., 2015), NUPACK (program: mfe) (Dirks & Pierce, 2003), and CONTRAfold (Do et al., 2006). Base pairs correctly predicted are depicted in red. These prediction methods cannot find pseudoknots or non‐Watson–Crick base pairs. (e) Structure prediction using both the sequence and per residue chemical reactivities obtained using the SHAPE‐Seq 2.0 method, an deposited in the URL (https://rmdb.stanford.edu/detail/SAMRSW_1M7_0001) (Loughrey et al., 2014), predicted using ViennaRNA (RNAfold –‐shape <reactivities> < sequence>). (f) Distributions of the free energies and number of base pairs observed in the best structures (calculated using ViennaRNA) for 200 randomized versions of the T. tengcongesis SAM‐I sequence. (g–i) Predicted structures for one shuffled SAM‐I sequence selected for having similar number of base pairs as the ViennaRNA prediction for the real SAM‐I sequence. (j) Structure predicted by ViennaRNA using the same randomized SAM‐I sequence and a set of randomized SHAPE reactivities
[ Normal View | Magnified View ]
RNA structure prediction using evolutionary information. Comparison of structural predictions by the methods: RNAalifold (Bernhart et al., 2008), R‐scape covariation analysis on the RNAalifold predicted structure (Rivas et al., 2017), Rfam (Kalvari et al., 2018), and CaCoFold (Rivas, 2020). For: (a) the glutamine riboswitch which includes a crystal structure, (b) the coronavirus 3′UTR pseudoknot, which includes results from in vivo SHAPE data (Manfredonia et al., 2020), and (c) the RtR RNA, an RNA element from the tyrT operon of Escherichia coli. The structures produced with RNAalifold incorporate the Rfam structure as a constraint (option ––SS_cons). The structures produced with CaCoFold perform the covariation analysis using the R‐scape two‐set mode that tests the Rfam proposed structure (option ‐s ––fold). In blue, motifs identified by covariation alone. Figure is derived from Figures S4, S5, and S7 of Rivas et al. (2020)
[ Normal View | Magnified View ]
Covariation not related to RNA structure. In orange, significant covariations not attributed to RNA structure. In green, significant covariations attributed to RNA structure. (a) The tRNA alignment shows three significantly covarying pairs in the AC loop not related to the RNA structure. For each pair, we provide information about the observed correlations. A:B mean that probability of finding B at the 3′ end of the pair, given there is an A at the 5′ end, that is P(B| A), is higher than 0.85; A:Bh means that it is between 0.5 and 0.85; and A:notB means that the probability of B given A is lower than 0.05. In pink, we also show a singly hydrogen‐bonded base pair with covariation support found at the junction of the AC loop and stem (Auffinger & Westhof, 1999). Some covarying pairs between the D and T loops also due to RNA structure are omitted for clarity. (b) In 6S RNA, one covarying pair between two residues contiguous in the backbone involving the first position of the RNA product (pRNA) produced by the molecule (Chen et al., 2017). (c) Multiple short‐range covariations in the mRNA‐like domain of the tmRNA (Ramrath et al., 2012). Many of the residues also show significant covariations due to the RNA structure. The covariation analysis was performed in the corresponding Rfam seed alignments using R‐scape and CaCoFold. Figure is derived from Figure 5 and Figure S6 of Rivas et al. (2020)
[ Normal View | Magnified View ]
Comparison of different measures of covariation on 19 structural RNAs. We report the number of total base pairs detected as a function of the positive predictive value (PPV, fraction of predictions that are base pairs). G‐test scores are calculated using R‐scape (option ––naive). EC (evolutionary coupling) scores are those provided in Weinreb et al. (2016) on the same alignments. BL‐DCA scores are calculated with the code provided in Cuturello et al. (2020). Dashed lines correspond to 50% PPV (horizontal) and 50% sensitivity (vertical). Results for each structural RNA are given in Figure S1 and Table S1. The seed alignments come from Rfam v14.2. The full alignments are those provided in Weinreb et al. (2016). The annotation of both Watson–Crick (WC) as well as non‐Watson–Crick (non‐WC) base pairs are derived from the PDB files using the program RNAview (Yang et al., 2003) (R‐scape option: ––pdb)
[ Normal View | Magnified View ]
Spurious covariations due to misalignments. (a, top) COOLAIR alignment presented in Hawkes et al. (2016). The green asterisks indicate the position of the five spurious covariations. (a, bottom) Realignment of the same sequences using the program Infernal. The realignment supports the same structure without substitution or gaps in the base paired positions. Derived from Figure 2d of Rivas et al. (2020). (b) Cartoon illustrating the sequence sliding effect that results in spurious covariations. (c) CaCoFold structure prediction for a collection of 312 signal recognition particle (SRP) RNAs from different species including metazoan, protozoa, plants, and bacteria (large and small). The SRP sequences are aligned using the program MUSCLE. (d) Reanalysis of the same sequences, by creating an Infernal model for just one of the sequences (Zea mays SRP) using the CaCoFold predicted structure. An alignment is produced using the Infernal program cmsearch with default E‐value cutoff. The Infernal alignment includes 75 out of the 312 SRP sequences that report a significant hit
[ Normal View | Magnified View ]
Covariation and power of covariation analysis of the Cyrano RNA putative structure. Proposed cloverleaf structure in the long noncoding RNA Cyrano. Boxed in black, base pairs that Jones et al. (2020) describe as evolutionarily conserved. The alignment was produced by searching 100 vertebrate genomes with an Infernal model built from the human Cyrano RNA cloverleaf sequence and structure provided in Jones et al. (2020). The hypothetical miR‐7 binding site is overlined in purple. The notation describing the alignment positions is given in Figure 4 (blue box)
[ Normal View | Magnified View ]
Examples of misidentified conserved RNA base pairs. (a) Example of three base pairs called “significantly covarying” in HOTAIR putative helix 11. The 352:370 pair (in human sequence coordinates) was called significantly covarying in both Somarowthu et al. (2015) and Tavares et al. (2019); the 334:387 and 344:380 pairs were also called significantly covarying in Somarowthu et al. (2015) but not in Tavares et al. (2019). Somarowthu' analysis calls the three base pairs significant solely on the basis that there is one compensatory mutation (circled in green) and less than 10% of the sequences are inconsistent with a canonical base pairs. Tavares' analysis still calls the 352:370 pair significantly covarying even after the residues in each column are permuted to destroy all covariation. Tavares used R‐scape with command: R‐scape ‐‐RAFSp ––window 500 ––slide 100 HOTAIR_D1.sto. Green: compensatory base pair substitutions relative to most abundant canonical base pair; blue: “half flips” (such as GC to GU); red: substitutions inconsistent with proposed base pair. In the current R‐scape, option ‐‐RAFSp can only be used in combination with ‐‐naive to report the full list of RAFS scores without the statistical test of covariation. Derived from Supplementary Figure 4 of Rivas et al. (2017) and Figure 2 of Rivas and Eddy (2018)
[ Normal View | Magnified View ]
Covariation versus covariation power as a function of sequence diversity. (a) An illustration of the expected covariation and power in alignments of: structural RNAs (green), not structural RNAs (red), RNAs too conserved in sequence to be able to decide whether they have a conserved structure or not (blue). Operationally, the red stripe described the region with at least six base pairs expected to covary, and no covariations observed. We use the term “observed covariation” to describe pairs that are called significant by R‐scape with an E‐value smaller than 0.05. (b–e) For several structural RNAs, the covariation and power for three different alignments are shown: the Rfam seed alignment (black), Tavares et al. (2019) high‐id alignment (blue) derived from the seed alignments by selecting sequences with high percentage identity, and Tavares only‐mammals alignment (orange) produced from the Rfam full sequences using Infernal. (f) For MALAT1, black corresponds to a 132 vertebrates alignment, and orange corresponds to a 13 mammals alignment derived from the previous one, both introduced in Tavares et al. (2019). (g,h) For 5S rRNA and RNaseP RNA, we analyze also another mammals‐only alignment that includes the same species as Tavares but where the selected sequence is the best E‐value Infernal hit per species (maroon). Details of the alignments, their covariation, and power are given in Table 1. (i) Detail of helix 3 of the 5S rRNA Tavares mammals‐only alignment. The human sequence in this alignment is located in chromosome 8 (26,136,880‐26,136,998), and it appears to be a pseudogene. (j) Detail of helix 3 of the 5S rRNA best‐hit Infernal mammals alignment. The human sequence is 1 of 16 identical genes, and belongs to the longest tandem array of 5S rRNA genes located in chromosome 1 (Sørensen et al., 1991) (e.g., chr1:228,632,631‐228,632,749, named RNA5S11). The E‐value of the Infernal search for each species is reported next to the alignment. Human coordinates are from assembly GRCh38/hg38. The alignments are provided in the Supplemental Materials
[ Normal View | Magnified View ]
Covariation versus power. Correlation between covariation and power for known structural RNAs, three long noncoding RNAs (lncRNAs) with proposed conserved structures, and structures proposed in coronavirus. (a) Analysis of all 3444 RNA families in Rfam 14.3 seed alignments. (b) Extended scale of the Rfam covariation/power concordance plot showing the analysis of proposed structures of the lncRNAs HOTAIR (Somarowthu et al., 2015), noncoding steroid receptor RNA activator (ncSRA) (Novikova et al., 2012) and RepA (Liu et al., 2017). (c) Concordance plots for 14 coronavirus RNA structures reported by Rfam (violet), and 106 proposed SARS‐CoV‐2 RNA structures Rangan et al. (2020) (cyan). The alignments for the 106 Rangan structures were generated using an Infernal model constructed for each proposed sequence/structure after searching a database of 124 RefSeq Nidovirales genomes (the viral order of which coronavirus is a family) downloaded on May 1, 2020 from NCBI. Alignments are provided in the Supplemental Material
[ Normal View | Magnified View ]
Covariation in alternative RNA structures. (a) R‐scape covariation analysis of the SAM‐I riboswitch and the purine riboswitch from alignments that include both the aptamer and the expression platform sequences, using a consensus structure that includes both the terminator and anti‐terminator alternative and overlapping helices. For the two riboswitches, both the terminator and anti‐terminator helices have covariation support. The SAM‐I riboswitch structural alignment including the terminator sequence was produced by Zhu and Meyer (2015). The purine riboswitch extended structural alignment was obtained from Ritz et al. (2013). (b) Two sets of alternative structures in U2 spliceosomal RNA: The branching interacting stem loop (BSL)/Stem‐I (Perriman & Ares, 2010), and the Stem‐IIa/Stem‐IIc alternative structures (Perriman & Ares, 2007). There is covariation evidence for three of the alternative helices Stem‐I, Stem‐IIa, and Stem‐IIb. The sequences forming the BSL are very conserved and lack covariation. The R‐scape analysis is performed in the Rfam U2 seed alignment (RF00004)
[ Normal View | Magnified View ]
RNA significant covariation above phylogenetic expectation, R‐scape. Statistical test for the structures of the SAM‐I riboswitch and the vertebrate telomerase RNA (vTR) using the Rfam seed alignments (with 433 and 37 sequences respectively). The SAM‐I riboswitch consensus structure is derived from the X‐ray 3D crystal structure (Montange & Batey, 2006). The vTR consensus structure is derived from Zhang et al. (2010). Using R‐scape option ‐s, two independent statistical tests are performed: one for the set of base pairs in the given structures and another for the rest of the pairs (blue and red respectively in panels c and d). (a,b) Depicted in green are the base pairs in the structures that significantly covary with an E‐value <0.05 using R‐scape statistical test for the proposed structures. The top six base pairs with lowest E‐values are marked with an arrow. For the SAM‐I riboswitch structure, 31 of the 38 base pairs covary above the phylogenetic signal. For the vertebrate telomerase RNA, 27 of the 107 base pairs significantly covary. For the SAM‐I riboswitch, there are two significant triple interaction not part of the proposed structure labeled “sc” (side‐covariation) and “xc” (cross covariation) respectively. (c,d) Cumulative distribution of covariation scores for the proposed base pairs (in blue) and all the rest of the pairs (in red). Covariation scores larger than 88 for the base pairs (larger than 179 for the rest of pairs) in the SAM‐I riboswitch are significantly covarying with E‐values <0.05. For the vTR, significant scores are those larger than 23 for the set of base pairs, and larger than 40 for the rest of pairs
[ Normal View | Magnified View ]
Spurious pairwise covariations can arise from uncorrelated substitutions on a phylogenetic tree. Two aligned positions (gray background) with identical mutual information, (a) one resulting from two independent substitutions (C to U and G to A) that happen to occur at the same branch in the phylogenetic tree, (b) the other resulting from four pairs of compensatory substitutions preserving a canonical RNA base pair (two C:G base pairs becoming U:A, and two U:A becoming C:G). Each of the four pairs of compensatory substitutions occurs at a different branch in the phylogenetic tree
[ Normal View | Magnified View ]
Three different patterns of sequence conservation with different implications for inferring RNA structure. (a) For the vertebrate telomerase RNA, a helix from the Rfam seed alignment (RF00024). The pattern of substitutions (calculated relative to consensus CCCC…GGGG) supports the helix being conserved throughout evolution. (b) From HOTAIR domain 1, putative helix 3 from the structural alignment provided by Somarowthu et al. (2015). The substitutions are mostly incompatible with the annotated helix. (c) Putative helix 11 from the same HOTAIR structural alignment in (b). The small number of changes makes it inconclusive whether the helix exists or not. In green, residues that preserve the structural annotation by making a compensatory base pair substitution relative to the consensus base pair; in blue, a half change (such as G:C to G:U) that also preserves the base pair; in red, changes that break the proposed base pair; and in gray, residues that are not analyzed. We display the mutual information (MI) of each of the original base pairs, and after the residues in each column are permuted to destroy all covariation
[ Normal View | Magnified View ]

Browse by Topic

RNA Evolution and Genomics > RNA and Ribonucleoprotein Evolution
RNA Evolution and Genomics > Computational Analyses of RNA
RNA Structure and Dynamics > RNA Structure, Dynamics, and Chemistry

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts