A study on Document Representation for Clustering Using Similarity Rough Set Model and Semantic Similarity (類似に基づくラフ集合モデルと意味的類似を用いたクラスタリングのための文章表現に関する研究)
氏名 NGUYEN CHI THANH
学位の種類 博士(工学)
学位記番号 博甲第609号
学位授与の日付 平成24年3月26日
学位論文題目 A study on Document Representation for Clustering Using Similarity Rough Set Model and Semantic Similarity (類似に基づくラフ集合モデルと意味的類似を用いたクラスタリングのための文章表現に関する研究)
論文審査委員
主査 教授 山田 耕一
副査 教授 福村 好美
副査 教授 三上 喜貴
副査 准教授 湯川 高志
副査 准教授 マーラシンハ チャンドラジット アーシュボーダ
[平成23(2011)年度博士論文題名一覧] [博士論文題名一覧]に戻る.
Table of Contents page
Acknowledgements p.i
Abstract p.ii
Table of Contents p.iv
List of Figures p.vii
List of Tables p.viii
Cahpter 1: Introduction p.1
1.1 Introduction p.1
1.2 Overview of document clustering study p.2
1.2.1 Document clustering applications p.2
1.2.2 Document clustering techniques p.6
1.3 Ojective of the research p.8
1.4 Organization of the dissertation p.9
Cahpter 2: Overview of document clustering methods p.11
2.1 Introduction p.11
2.2 Clustering evaluation methods p.11
2.3 Docment clustering algorithms p.14
2.3.1 Hierarchical clustering methods p.14
2.3.2 Partitioning clustering methods p.15
2.4 Docment representaion models for document clustering p.17
2.4.1 Vector space model p.17
2.4.2 Latent semantic indexing p.20
2.4.3 Tolerance rough set model p.22
2.4.4 WordNet semantic similarity based models p.31
Cahpter 3: Similarity Rough Set Model for document representaion and document clustering p.37
3.1 Introduction p.37
3.2 A new definition of tplerance rough set model for document clustering p.38
3.3 Similarity rough set model for document clustering p.41
3.4 Design of the experiment p.48
3.5 Experimental results p.54
3.6 Conclusion p.64
Cahpter 4: WordNet Based Similarity Rough Set Model for document representaion and document clustering p.66
4.1 Introduction p.66
4.2 SRSM and WordNet semantic similarity based models p.67
4.3 WordNed based similarity rough set model for document clustering p.69
4.4 Experimental results p.71
4.5 Conclusions p.78
Cahpter 5: Recommendation for future researches p.80
5.1 Two pass approach for document collection with large overlap of terms p.80
5.2 Semi-supervised learning in document clustering p.86
Cahpter 6: Conclusions p.89
References p.91
Appendix p.102
List of Publications p.103
Today, with the development of the Internet, text data is growing fast. The huge amount of document collections makes it difficult to efficiently organize and extract useful information from document collections. As a useful tool for mining text data, document clustering becomes a research interest of many researchers. Document clustering is a process of grouping similar documents into classes. To apply clustering algorithm to a document collection, first documents have to be represented as vectors of words. Most of document clustering methods in use today are based on the vector space model. However, this representation model does not consider the semantic relatedness of words which can lead to poor clustering results. Several approaches have been proposed to add the semantic awareness into document representation such as tolerance rough set model, which uses an expansion from rough set model with a tolerance relation, or WordNet semantic similarity models, which modify the vector space model by readjusting weights of terms in the documents. Those approaches are better than the vector space model but they still have disadvantages.
This research proposed two new models for document representation. The first is similarity rough set model (SRSM) which extends vector space model using a generalized definition of rough set model based on similarity relation. In SRSM, we calculate the semantic relation between terms using co-occurrence of terms that lets us define similarity relations automatically without any knowledge base. We applied clustering algorithm to term vectors that consists of terms in upper approximations of ordinary document vectors. The usage of upper approximations of document vectors adds the awareness of semantic similarity to the representation model and reduces the sparsity of document space. Experiments were done with two document collections to evaluate SRSM based document clustering method with conventional methods. The results showed that the proposed method delivers better clustering quality than the others in all three quality measures in use, which are entropy, mutual information and F measure.
The second proposed model is WordNet based SRSM which incorporates SRSM and WordNet semantic similarity to integrate the advantages of both the models. The SRSM can automatically generate semantic relatedness information from document collection without any knowledge base and semantic knowledge from the WordNet ontology provides high reliable similarity measure between terms. The upper approximations of document vectors are calculated based on co-occurrence of terms and WordNet based semantic similarity. The weights of terms in a document are also adjusted using WordNet based semantic similarity. Experimental results from two document collections showed that the WordNet based SRSM method can improve clustering results over the SRSM and WordNet semantic similarity based method.
本論文は,「A Study on Document Representation for Clustering Using Similarity Rough Set Model and Semantic Similarity(類似に基づくラフ集合モデルと意味的類似を用いたクラスタリングのための文書表現に関する研究)」と題し,6章より構成されている。第1章「Introduction」では,情報社会におけるテキストマイニングの有用性とテキストマイニングにおける文書クラスタリングの位置づけ,文書クラスタリング技術の現状と応用,および本研究の目的と範囲を述べている.
第2章「Overview of document clustering methods」では,クラスタリング結果の評価法,クラスタリング・アルゴリズム,クラスタリングのための文書表現法について,既存技術および研究をまとめている.クラスタリング評価法では,本論文でも使用する3つの代表的な評価法について述べ,クラスタリング・アルゴリズムは,階層的手法と分割的手法に大別し,既存のアルゴリズムをまとめる.クラスタリングのための文書表現については,Vector Space Model,Latent Semantic Indexing, Tolerance Rough Set Model, WordNet Semantic Similarity Based Modelsの概要を示し,それらの手法の問題点を整理している.
第3章「Similarity Rough Set Model for document representation and document clustering」では,VSMとTRSMの問題点を解決するためにSimilarity Rough Set Modelを提案する.TRSMは,反射律と交換律を満たす2項関係(Tolerance relation)に基づくモデルで,SRSMは反射律のみを満たす2項関係(Similarity relation)に基づくモデルである.本章では二種類の文書集合にTRSMとSRSM,およびVSMに基づくツール群(CLUTO TOOLKIT)を適用してクラスタリングを行い,提案したSRSMによるクラスタリング結果が最も良いことを示した.
第4章「WordNet Based Similarity Rough Set Model for document representation and document clustering」では,文書中の単語の共起と電子辞書WordNetにおける単語の類似性を組み合わせた新しい類似関係を考案し,その類似関係に基づくモデルを提案した.このモデルを第3章で用いた二種類の文書集合に適用し,クラスタリングを行ったところ,SRSM以上に評価の良いクラスタリング結果が得られることを示した.
第5章「Recommendation for future researches」では,残された課題と今後の研究について述べ,最後に,第6章「Conclusions」で本研究全体の成果をまとめている.
よって,本論文は工学上及び工業上貢献するところが大きく,博士(工学)の学位論文として十分な価値を有するものと認める.