Functional annotation of the animal genomes: an integrated annotation resource for the horse

The genomic sequence of the horse has been available since 2009, providing critical resources for discovering important genomic variants regarding both animal health and population structures. However, to fully understand the functional implications of these variants, detailed annotation of the horse genome is required. Currently, the horse genome is annotated using limited available RNA-seq data, as well as through comparative genomics by translating human and mouse genome annotation. While this approach has served the equine researchers well and led to various discoveries that would improve care and management of horses, many important questions remain unanswered. The limitation of the current annotation is two pronged. First, the heavy reliance on short-read sequencing-based RNA-seq data meant that alternate isoforms could not be accurately resolved. Second, epigenomic regulatory elements are crucial to detailed understanding of gene expression network but are yet to be systemically identified in the horse. Many of these regulatory elements, including enhancers, promoters, and insulators, are either not transcribed or transcribed at a very low level, necessitating alternate approaches to identify them. To solve above problems, the Functional Annotation of the Animal Genomes (FAANG) project proposed a systemic approach to tissue collection, phenotyping, and data generation, adopting the blueprint laid out by the Encyclopedia of DNA Elements (ENCODE) project. Here we detail the first comprehensive overview of gene expression and regulation in the horse, providing the equine research community an expanded set of genomics resources for studies of complex traits in the horse. Author Summary Functional annotation of a reference genome provides critical information that pertains the tissue-specific gene expression and regulation. Non-model organisms often rely on existing annotations of human and mouse genomes and the conservation between species for their genome annotation. This approach has limited power in annotating transcripts and regulatory elements that are less evolutionarily conserved. Such are the cases of alternatively spliced isoforms and enhancer elements. In a large-scale collaborated effort, Functional Annotation of Animal Genome (FAANG) aims to generate species-specific and tissue-aware functional annotation for farm animals. In this study, we present the overall annotation efforts and findings from the equine FAANG group. This integrated annotation for the horse genome provides, for the first time, a comprehensive overview of gene expression and regulation landscape in nine major equine tissues, as well as an analytical framework for further inclusion of other important tissues.


62
A reference genome for the horse has been available since 2007 (1)

105
While the complex molecular mechanism through which this process is regulated 106 remains an active field of research, a growing body of evidence points to histone protein post-107 translational modifications as an important intermediary of transcription regulation (30-32).

111
Additionally, H3K27ac is enriched around active elements and associated with higher levels of 112 gene expression (38). On the other hand, H3K27me3 is usually found around genes that are not 113 active (39). Taken together, these protein modifications can be strong indicators of functional 114 activities in specific genomic regions.

115
To improve AS annotation for the horse transcriptome and to systemically categorize 116 these epigenetic features and identify potential CREs in the equine genome, we collected over

143
The tissue-specific expression of these transcripts was quantified using short-read RNA-144 seq data from 57 tissues of the same animals, nine of which were the same tissues used to 145 generate Iso-seq data. Approximately 78% of known isoforms were expressed in at least half of 146 the tissues sequenced, while novel isoforms of known genes and novel intergenic transcripts each showed a bimodal distribution, with 44.3% of novel isoforms and 56.8% of intergenic 148 transcripts detected in less than half of the tissues ( Figure 1C). We also noted that, on average,  Figure 1D).

157
Since only nine tissues were used to construct this Iso-seq transcriptome, it was 158 expected that many tissue-specific genes and transcripts will be missing. Indeed, several 159 tissues had decreased mapping rates when aligned to the Iso-seq transcriptome as compared     Table 1). For each tissue, 11-22% peaks were 199 located within promoter-TSS regions. However, the same pattern was not observed in tissue-200 specific peaks. Only 3-5% of tissue specific peaks were in the promoter-TSS regions, while 201 substantially more peaks were located in intronic or intergenic regions (17-22% intronic, 21-34% 202 intergenic peaks across tissues; 23-26% intronic, 23-45% intergenic tissue-specific peaks).

203
Motif analyses of these intergenic regions revealed a diverse range of TF binding sites, such as  Table 2).

209
Unsurprisingly, TSS accessibility showed significant correlations with their corresponding 211 gene expression. Figure 5 shows a representative differential accessibility and expression 212 analysis of parietal cortex and heart tissues. Overall, approximately 16% of peaks showed 213 differential accessibility (FDR adjusted p<0.05), with 25,144 peaks (7.6%) more accessible in 214 cortex and 29,206 peaks (8.8%) more accessible in heart ( Figure 5A). The log 2 fold-change

215
(log 2 FC) of differentially accessible regions (DAR) was significantly correlated with log 2 FC of 216 differentially expressed genes (DEG) in the same cortex and heart samples (one-sided Wald 217 test, p<1 x 10 -5 , Pearson correlation coefficient r=0.4, Figure 5B). After selecting peak-gene 218 pairs whose FDR adjusted p values from both DAR and DEG analyses were below 0.05, we 219 observed that most genes were located in quadrants 1 and 3 (Q1 and Q3 respectively), showing 220 concordant changes in promoter-TSS accessibility and gene expression ( Figure 5C). GO 221 enrichment analyses showed that Q1 genes were primarily associated with neural activities 222 while muscular and cardiac related GO terms were enriched among Q3 genes (Supp Tables 3   223  and 4). There were also 175 and 177 genes in Q2 and Q4, respectively. GO Terms related to 224 synaptic activity were enriched in Q2 (Supp Table 5) while Q4 genes were overrepresented in 225 actin-filament based processes.

227
Cis-regulatory Element Annotation

228
Chromatin states were first identified using four major histone modifications (H3K4me1,

230
14 unique states, corresponding to enhancer, promoter, and insulator states of various degrees 231 of activities, as well as polycomb repressed state were identified ( Figure 6A). Notably, the 232 CTCF-bound active TSS state (state 4), co-enriched with CTCF and active promoter marks 233 (H3K4me3 and H3K27ac), was highly enriched around TSS, whereas the CTCF-less active 234 TSS state (state 3) was more enriched at approximately 500 bp up-and down-stream of TSS. genome, with the polycomb repressed state (state 13) covering the largest portion of the 237 genome across tissues, followed by enhancer states (states 6-10, Figure 6B). While promoter 238 states only accounted for 3-5% of the genome, or around 20% of all annotated states, they 239 comprised over 50% of states annotated at TSS regions ( Figure 6B, C).

240
To correlate gene expression with chromatin state annotation, companion RNA-seq data 241 for each tissue from FAANG was used to quantify the equine FAANG transcriptome via quasi-242 mapping (45). Transcript level quantification was then summarized to gene level using tximport   of REs in these tissues. These annotated REs will be an invaluable addition to the equine 408 reference genome assembly. The similar annotation provided by ENCODE has led to 409 discoveries of many regulatory variants in various diseases (22,24,59

444
From the outset of the equine FAANG initiative, researchers were invited to "adopt"

460
Nine tissues (lamina, liver, left lung, left ventricle of heart, longissimus muscle, skin, 461 parietal cortex, testis, and ovary) from the FAANG biobank (40,41) were selected for Iso-seq to represent a wide range of biological functions and therefore, gene expression. RNA for Iso-seq 463 was extracted separately from the same tissues as mRNA-seq using the same protocol. All 464 tissues were processed in one batch for Iso-seq, except for parietal cortex, which was 465 processed in a separate batch as a pilot study. One sample per sex per tissue was selected for 466 sequencing based on sample availability and RNA integrity numbers (RINs selected > 7  procedure was employed to subsequently merge biological replicates and then all tissue peak 529 sets to generate a union set of peaks. A count matrix was constructed for the union peak set 530 containing number of transposition events per peak per sample. This count matrix was used for 531 differential accessibility analyses using DESeq2 (80). The union peak set was then intersected 532 with each tissue peak set to determine if a peak was present in each tissue. Peaks only 533 identified in one tissue type were denoted "unique" peaks while those identified in all nine 534 tissues were denoted as "conserved".

648
Changes in percentages of properly paired reads aligned to combined Iso-seq transcriptome 649 when compared to Ensembl or RefSeq transcriptomes, whichever has higher percentage. The proportion of segments from each state that were identified in different numbers of tissues.