CSpace
A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses
An, Shaokun1; Ren, Jie2; Sun, Fengzhu2; Wan, Lin1,3
2022-04-22
发表期刊JOURNAL OF COMPUTATIONAL BIOLOGY
ISSN1066-5277
页码18
摘要The statistical inference of high-order Markov chains (MCs) for biological sequences is vital for molecular sequence analyses but can be hindered by the high dimensionality of free parameters. In the seminal article by Buhlmann and Wyner, variable length Markov chain (VLMC) model was proposed to embed the full-order MC in a sparse structured context tree. In the key procedure of tree pruning of their proposed context algorithm, the word count-based statistic for each branch was defined and compared with a fixed cutoff threshold calculated from a common chi-square distribution to prune the branch of the context tree. In this study, we find that the word counts for each branch are highly intercorrelated, resulting in non-negligible effects on the distribution of the statistic of interest. We demonstrate that the inferred context tree based on the original context algorithm by Buhlmann and Wyner, which uses a fixed cutoff threshold based on a common chi-square distribution, can be systematically biased and error prone. We denote the original context algorithm as VLMC-Biased (VLMC-B). To solve this problem, we propose a new context tree inference algorithm using an adaptive tree-pruning scheme, termed VLMC-Consistent (VLMC-C). The VLMC-C is founded on the consistent branch-specific mixed chi-square distributions calculated based on asymptotic normal distribution of multiple word patterns. We validate our theoretical branch-specific asymptotic distribution using simulated data. We compare VLMC-C with VLMC-B on context tree inference using both simulated and real genome sequence data and demonstrate that VLMC-C outperforms VLMC-B for both context tree reconstruction accuracy and model compression capacity.
关键词biological sequence analyses consistent context algorithm variable length Markov chains word count statistics
DOI10.1089/cmb.2021.0604
收录类别SCI
语种英语
资助项目National Key Research and Development Program of China[2019YFA0709501] ; National Natural Science Foundation of China[12071466] ; National Institutes of Health[R01GM120624] ; National Institutes of Health[1R01GM131407] ; National Science Foundation[EF-2125142]
WOS研究方向Biochemistry & Molecular Biology ; Biotechnology & Applied Microbiology ; Computer Science ; Mathematical & Computational Biology ; Mathematics
WOS类目Biochemical Research Methods ; Biotechnology & Applied Microbiology ; Computer Science, Interdisciplinary Applications ; Mathematical & Computational Biology ; Statistics & Probability
WOS记录号WOS:000792234400001
出版者MARY ANN LIEBERT, INC
引用统计
文献类型期刊论文
条目标识符http://ir.amss.ac.cn/handle/2S8OKBNM/60418
专题中国科学院数学与系统科学研究院
通讯作者Wan, Lin
作者单位1.Chinese Acad Sci, Acad Math & Syst Sci, Beijing, Peoples R China
2.Univ Southern Calif Los Angeles, Quantitat & Computat Biol Dept, Los Angeles, CA USA
3.Chinese Acad Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
推荐引用方式
GB/T 7714
An, Shaokun,Ren, Jie,Sun, Fengzhu,et al. A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses[J]. JOURNAL OF COMPUTATIONAL BIOLOGY,2022:18.
APA An, Shaokun,Ren, Jie,Sun, Fengzhu,&Wan, Lin.(2022).A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses.JOURNAL OF COMPUTATIONAL BIOLOGY,18.
MLA An, Shaokun,et al."A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses".JOURNAL OF COMPUTATIONAL BIOLOGY (2022):18.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[An, Shaokun]的文章
[Ren, Jie]的文章
[Sun, Fengzhu]的文章
百度学术
百度学术中相似的文章
[An, Shaokun]的文章
[Ren, Jie]的文章
[Sun, Fengzhu]的文章
必应学术
必应学术中相似的文章
[An, Shaokun]的文章
[Ren, Jie]的文章
[Sun, Fengzhu]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。