RESULTS: We show that SYMRK is essential for nodulation and endomycorrhization in Parasponia andersonii. Subsequently, it is revealed that the 5'-intron donor splice site of SYMRK intron 12 is variable and, in most dicotyledon species, doesn't contain the canonical dinucleotide 'GT' signature but the much less common motif 'GC'. Strikingly, in T. orientalis, this motif is converted into a rare non-canonical 5'-intron donor splice site 'GA'. This SYMRK allele, however, is fully functional and spreads in the T. orientalis population of Malaysian Borneo. A further investigation into the occurrence of the non-canonical GA-AG splice sites confirmed that these are extremely rare.
CONCLUSION: SYMRK functioning is highly conserved in legumes, actinorhizal plants, and Parasponia. The gene possesses a non-common 5'-intron GC donor splice site in intron 12, which is converted into a GA in T. orientalis accessions of Malaysian Borneo. The discovery of this functional GA-AG splice site in SYMRK highlights a gap in our understanding of splice donor sites.
FINDINGS: We optimized the assembly of a Hevea bark transcriptome based on 16 Gb Illumina PE RNA-Seq reads using the Oases assembler across a range of k-mer sizes. We then assessed assembly quality based on transcript N50 length and transcript mapping statistics in relation to (a) known Hevea cDNAs with complete open reading frames, (b) a set of core eukaryotic genes and (c) Hevea genome scaffolds. This was followed by a systematic transcript mapping process where sub-assemblies from a series of incremental amounts of bark transcripts were aligned to transcripts from the entire bark transcriptome assembly. The exercise served to relate read amounts to the degree of transcript mapping level, the latter being an indicator of the coverage of gene transcripts expressed in the sample. As read amounts or datasize increased toward 16 Gb, the number of transcripts mapped to the entire bark assembly approached saturation. A colour matrix was subsequently generated to illustrate sequencing depth requirement in relation to the degree of coverage of total sample transcripts.
CONCLUSIONS: We devised a procedure, the "transcript mapping saturation test", to estimate the amount of RNA-Seq reads needed for deep coverage of transcriptomes. For Hevea de novo assembly, we propose generating between 5-8 Gb reads, whereby around 90% transcript coverage could be achieved with optimized k-mers and transcript N50 length. The principle behind this methodology may also be applied to other non-model plants, or with reads from other second generation sequencing platforms.