CSpace
Text classification based on multi-word with support vector machine
Zhang, Wen1; Yoshida, Taketoshi1; Tang, Xijin2
2008-12-01
发表期刊KNOWLEDGE-BASED SYSTEMS
ISSN0950-7051
卷号21期号:8页码:879-886
摘要One of the main themes which support text mining is text representation: that is, its task is to look for appropriate terms to transfer documents into numerical vectors. Recently, many efforts have been invested on this topic to enrich text representation using vector space model (VSM) to improve the performances of text mining techniques such as text classification and text clustering. The main concern in this paper is to investigate the effectiveness of using multi-words for text representation on the performances of text classification. Firstly, a practical method is proposed to implement the multi-word extraction from documents based on the syntactical structure. Secondly, two strategies as general concept representation and subtopic representation are presented to represent the documents using the extracted multi-words. In particular, the dynamic k-mismatch is proposed to determine the presence of a long multi-word which is a subtopic of the content of a document. Finally, we carried out a series of experiments on classifying the Reuters-21578 documents using the representations with multi-words. We used the performance of representation in individual words as the baseline, which has the largest dimension of feature set for representation without linguistic preprocessing. Moreover, linear kernel and non-linear polynomial kernel in support vector machines (SVM) are examined comparatively for classification to investigate the effect of kernel type on their performances. Index terms with low information gain (IG) are removed from the feature set at different percentages to observe the robustness of each classification method. Our experiments demonstrate that in multi-word representation, subtopic representation outperforms the general concept representation and the linear kernel outperforms the non-linear kernel of SVM in classifying the Reuters data. The effect of applying different representation strategies is greater than the effect of applying the different SVM kernels on classification performance. Furthermore, the representation using individual words outperforms any representation using multi-words. This is consistent with the major opinions concerning the role of linguistic preprocessing on documents' features when using SVM for text classification. (C) 2008 Elsevier B.V. All rights reserved.
关键词Text classification Multi-word Feature selection Information gain Support vector machine
DOI10.1016/j.knosys.2008.03.044
语种英语
资助项目Ministry of Education, Culture. Sports, Science and Technology of Japan ; National Natural Science Foundation of China[70571078] ; National Natural Science Foundation of China[70221001]
WOS研究方向Computer Science
WOS类目Computer Science, Artificial Intelligence
WOS记录号WOS:000261736300017
出版者ELSEVIER SCIENCE BV
引用统计
文献类型期刊论文
条目标识符http://ir.amss.ac.cn/handle/2S8OKBNM/6219
专题中国科学院数学与系统科学研究院
通讯作者Zhang, Wen
作者单位1.Japan Adv Inst Sci & Technol, Sch Knowledge Sci, Tatsunokuchi, Ishikawa 9231292, Japan
2.Chinese Acad Sci, Inst Syst Sci, Acad Math & Syst Sci, Beijing 100190, Peoples R China
推荐引用方式
GB/T 7714
Zhang, Wen,Yoshida, Taketoshi,Tang, Xijin. Text classification based on multi-word with support vector machine[J]. KNOWLEDGE-BASED SYSTEMS,2008,21(8):879-886.
APA Zhang, Wen,Yoshida, Taketoshi,&Tang, Xijin.(2008).Text classification based on multi-word with support vector machine.KNOWLEDGE-BASED SYSTEMS,21(8),879-886.
MLA Zhang, Wen,et al."Text classification based on multi-word with support vector machine".KNOWLEDGE-BASED SYSTEMS 21.8(2008):879-886.
条目包含的文件
条目无相关文件。
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[Zhang, Wen]的文章
[Yoshida, Taketoshi]的文章
[Tang, Xijin]的文章
百度学术
百度学术中相似的文章
[Zhang, Wen]的文章
[Yoshida, Taketoshi]的文章
[Tang, Xijin]的文章
必应学术
必应学术中相似的文章
[Zhang, Wen]的文章
[Yoshida, Taketoshi]的文章
[Tang, Xijin]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。