IsoSeq transcriptome assembly of C3 panicoid grasses provides tools to study evolutionary change in the Panicoideae

Abstract The number of plant species with genomic and transcriptomic data has been increasing rapidly. The grasses—Poaceae—have been well represented among species with published reference genomes. However, as a result the genomes of wild grasses are less frequently targeted by sequencing efforts. Sequence data from wild relatives of crop species in the grasses can aid the study of domestication, gene discovery for breeding and crop improvement, and improve our understanding of the evolution of C4 photosynthesis. Here, we used long‐read sequencing technology to characterize the transcriptomes of three C3 panicoid grass species: Dichanthelium oligosanthes, Chasmanthium laxum, and Hymenachne amplexicaulis. Based on alignments to the sorghum genome, we estimate that assembled consensus transcripts from each species capture between 54.2% and 65.7% of the conserved syntenic gene space in grasses. Genes co‐opted into C4 were also well represented in this dataset, despite concerns that because these genes might play roles unrelated to photosynthesis in the target species, they would be expressed at low levels and missed by transcript‐based sequencing. A combined analysis using syntenic orthologous genes from grasses with published reference genomes and consensus long‐read sequences from these wild species was consistent with previously published phylogenies. It is hoped that these data, targeting underrepresented classes of species within the PACMAD grasses—wild species and species utilizing C3 photosynthesis—will aid in future studies of domestication and C4 evolution by decreasing the evolutionary distance between C4 and C3 species within this clade, enabling more accurate comparisons associated with evolution of the C4 pathway.


28
The pace of plant genome sequencing has accelerated in recent years. However despite decreases in 29 sequencing costs and improvements in genome assembly quality, species selected for whole genome 30 sequencing often meet one or more of the following criteria: A) agricultural importance, B) status 31 as a genetic model system or C) ecological importance. Sequence data from species which lack di-32 rect economic, ecological, or genetic model importance can enable comparative analyses to address 33 variants of the C 4 photosynthetic pathway (Schnable et al., 2009;Garsmeur et al., 2018;Paterson et al., 2009;Bennetzen et al., 2012). As a result, while published whole genome sequence assemblies 51 exist for at least 14 grasses within the PACMAD clade (Table 1), only one of these (Dichanthelium 52 oligosanthes, a wild species) (Studer et al., 2016) utilizes C 3 photosynthesis. Long-read sequencing 53 can effectively generate sequence for large numbers of full length cDNAs even in species lacking 54 reference genome assemblies (An et al., 2018;Zhang et al., 2019). One concern with utilizing this 55 technology for comparative genetic studies is that the higher error rate, particularly the frequencies 56 of insertion and deletion errors, make data from long read based sequencing of non-model species 57 unsuitable for use in comparative evolutionary analyses (Gonzalez-Garay, 2016). However, we pre-58 viously found that observed synonymous substitution rates calculated from consensus sequences 59 constructed using PacBio IsoSeq pipeline were not elevated relative to a sister lineage where gene 60 sequences were taken from a sanger-based whole genome assembly, indicating sequence data ob-61 tained in this manner may indeed be suitable for comparative evolutionary analyses (Yan et al., required to include an in frame stop codon but were not required to include an in-frame "ATG"

193
The number of raw reads generated per species was largely consistent and ranged from 708,681 to   in a given target species (Table 3). Here we were using only single library was constructed per 224 species, rather than multiple libraries constructed using different size fractions, the use of RNA 225 from a single tissue rather than pooled RNA from multiple tissue types, and were conducting com- improvements between the RSII and Sequel iterations of this sequencing technology.

231
Manual curation was used to access the coverage and quality of sequences retrieved from these 232 three C 3 photosynthesis-utilizing PACMAD species for five genes known to be involved in C 4 233 photosynthesis: PPDK, PEPC, NADP-MDH, NAD-ME and DCT2 in C 4 photosynthesis-utilizing 234 PACMAD species. In four cases, the representative transcript identified from each of the three these sets of conserved syntenic genes and corresponding transcripts from 0, 1, 2, or 3 of the target 253 species is provided as part of Supplemental Material 1.

254
One potential concern is using transcriptome data from species utilizing C 3 photosynthesis to 255 provide sequence data for comparative genetic and evolutionary analyses of C 4 is that enzymes

295
Among the 700 remaining gene trees, 304 (43%) produced a single topology consistent with the 296 prior literature on the relationship of these species (Figure 3). The second and third most common 297 topologies were each represented by less than 7% of all calculated trees, 47 and 44 cases respectively.

298
The second and third most common topologies differed from prior published phylogenies regarding 299 the placement H. amplexicaulis. In the second most common topology H. amplexicaulis was placed 300 sister to all other panicoid grass species other than C. laxum. In the third most common topology 301 H. amplexicaulis was placed sister to the Paniceae. Parallel analysis was conducted using all 4,908 302 conserved orthologous gene groups, including many cases with substantially shorter regions of high 303 quality multiple sequence alignment. The pattern of trees recovered were largely consistent with 304 those in (Figure 3). In the "all genes" analysis, the same most common topology was retrieved as 305 in the long alignment only analysis. The second most common topology in the "all genes" analysis 306 corresponds to the most third most common topology in Figure 3, while the third most common in 307 the "all genes" topology places C. laxum as sister to the combined Chloridoideae and Panicoideae 308 (Supplemental Figure S3).