Date of this Version
Background: Selenocysteine (Sec) is a rare amino acid which occurs in proteins in major domains of life. It is encoded by TGA, which also serves as the signal for termination of translation, precluding identification of selenoprotein genes by available annotation tools. Information on full sets of selenoproteins (selenoproteomes) is essential for understanding the biology of selenium. Herein, we characterized the selenoproteome of the largest microbial sequence dataset, the Sargasso Sea environmental genome project. Results: We identified 310 selenoprotein genes that clustered into 25 families, including 101 new selenoprotein genes that belonged to 15 families. Most of these proteins were predicted redox proteins containing catalytic selenocysteines. Several bacterial selenoproteins previously thought to be restricted to eukaryotes were detected by analyzing eukaryotic and bacterial SECIS elements, suggesting that eukaryotic and bacterial selenoprotein sets partially overlapped. The Sargasso Sea microbial selenoproteome was rich in selenoproteins and its composition was different from that observed in the combined set of completely sequenced genomes, suggesting that these genomes do not accurately represent the microbial selenoproteome. Most detected selenoproteins occurred sporadically compared to the widespread presence of their cysteine homologs, suggesting that many selenoproteins recently evolved from cysteine-containing homologs. Conclusions: This study yielded the largest selenoprotein dataset to date, doubled the number of prokaryotic selenoprotein families and provided insights into forces that drive selenocysteine evolution.