Application of linker length and linker length dependency in identification of protein domains

Ling Zhang, University of Nebraska - Lincoln


In protein sequences, domains are identified as conserved unit of structure, function and evolution. Identification of protein domains is important for the functional analysis of proteins. To achieve more sensitive and accurate domain discovery, we developed novel probabilistic modeling of multi-domain protein architectures. In our hidden Markov model (HMM) and Double-chain Markov model (DCMM), we incorporate not only domain dependency but also inter-domain linker information. The HMM using domain dependency with linker lengths (HMM-DL) successfully harnesses domain dependency and inter-domain linker lengths observed in the training dataset to predict divergent and non-overlapping domains on protein sequences. Moreover, a simulation procedure has been developed, which allows us to estimate false discovery rates and false positive rates to assess our approaches. We also present DCMM using domain dependency with linker lengths and linker-length dependency (DCMM-DLL) for the predictions of domains. By using DCMM, which has not been used in the field of bioinformatics, we are able to remove the limitation of the conditional independence assumption between observations and improve domain discovery performance. To increase the number of correct domain identifications, HMM-DL and DCMM-DLL were also extended to allow some overlapping domain identifications.^

Zhang, Ling, "Application of linker length and linker length dependency in identification of protein domains" (2016). ETD collection for University of Nebraska - Lincoln. AAI10247606.