RESULTS: The study generated 846,762 high quality sequence reads, with an average length of 334 bp and totalling 283 Mbp. De novo assembly generated 36,384 and 35,269 unigene sequences for M. acuminata Calcutta 4 and Cavendish Grande Naine, respectively. A total of 64.4% of the unigenes were annotated through Basic Local Alignment Search Tool (BLAST) similarity analyses against public databases.Assembled sequences were functionally mapped to Gene Ontology (GO) terms, with unigene functions covering a diverse range of molecular functions, biological processes and cellular components. Genes from a number of defense-related pathways were observed in transcripts from each cDNA library. Over 99% of contig unigenes mapped to exon regions in the reference M. acuminata DH Pahang whole genome sequence. A total of 4068 genic-SSR loci were identified in Calcutta 4 and 4095 in Cavendish Grande Naine. A subset of 95 potential defense-related gene-derived simple sequence repeat (SSR) loci were validated for specific amplification and polymorphism across M. acuminata accessions. Fourteen loci were polymorphic, with alleles per polymorphic locus ranging from 3 to 8 and polymorphism information content ranging from 0.34 to 0.82.
CONCLUSIONS: A large set of unigenes were characterized in this study for both M. acuminata Calcutta 4 and Cavendish Grande Naine, increasing the number of public domain Musa ESTs. This transcriptome is an invaluable resource for furthering our understanding of biological processes elicited during biotic stresses in Musa. Gene-based markers will facilitate molecular breeding strategies, forming the basis of genetic linkage mapping and analysis of quantitative trait loci.
FINDINGS: We optimized the assembly of a Hevea bark transcriptome based on 16 Gb Illumina PE RNA-Seq reads using the Oases assembler across a range of k-mer sizes. We then assessed assembly quality based on transcript N50 length and transcript mapping statistics in relation to (a) known Hevea cDNAs with complete open reading frames, (b) a set of core eukaryotic genes and (c) Hevea genome scaffolds. This was followed by a systematic transcript mapping process where sub-assemblies from a series of incremental amounts of bark transcripts were aligned to transcripts from the entire bark transcriptome assembly. The exercise served to relate read amounts to the degree of transcript mapping level, the latter being an indicator of the coverage of gene transcripts expressed in the sample. As read amounts or datasize increased toward 16 Gb, the number of transcripts mapped to the entire bark assembly approached saturation. A colour matrix was subsequently generated to illustrate sequencing depth requirement in relation to the degree of coverage of total sample transcripts.
CONCLUSIONS: We devised a procedure, the "transcript mapping saturation test", to estimate the amount of RNA-Seq reads needed for deep coverage of transcriptomes. For Hevea de novo assembly, we propose generating between 5-8 Gb reads, whereby around 90% transcript coverage could be achieved with optimized k-mers and transcript N50 length. The principle behind this methodology may also be applied to other non-model plants, or with reads from other second generation sequencing platforms.