Design and Implementation of Search Engine for Myanmar Language Web Documents （ミャンマー語ウェブ文書サーチエンジンの設計と開発）

氏名　PANN YU MON
学位の種類　博士（工学）
学位記番号　博甲第592号
学位授与の日付　平成23年8月31日
学位論文題目　Design and Implementation of Search Engine for Myanmar Language Web Documents　（ミャンマー語ウェブ文書サーチエンジンの設計と開発）
論文審査委員
　主査　教授　三上　善貴
　副査　教授　浅井　達雄
　副査　教授　吉川　敏則
　副査　准教授　五島　洋行
　副査　准教授　湯川　高志

論文目次
論文要旨
論文審査の結果の要旨

［平成23（2011）年度博士論文題名一覧］［博士論文題名一覧］に戻る．

論文目次

Table of Contents　Page
List of tables　p.IV
List of figures　p.VI
Abstract　p.1
Acknowledgement　p.3
Chapter1　Introduction　p.5
　1.1　Introduction　p.5
　1.2　Characteristic of Myanmar Scripts　p.6
　1.3　Standardization of Fonts and its Encodings Issure　p.6
　1.4　Characteristic of Myanmar Web Page on the Web　p.8
　1.5　Internet Usage in Myanmar　p.8
　1.6　Specific Features of Myanmar Language　p.9
　1.6.1　Myanmar Compound Words　p.9
　1.6.2　Myanmar Stacked Words　p.10
　1.7　Reseach Motivation　p.13
　1.8　Background and Related Studies.　p.14
　1.9　Objectibe of the Study　p.17
　1.1　Contributions　p.17
Chapter2　Language Specific Search Engines Architecture　p.19
　2.1　Introduction　p.19
　2.1.1　History of Search Engines （From 1945 to Today Google）　p.21
　2.1.2　Limitation of Search Engines　p.24
　2.1.3　Why Language Specific Search Engine is needed?　p.25
　2.2　Proposed Algorithm of Search Engine　p.26
　2.2.1　Language Specific Web Crawling　p.27
　2.2.2　Pre-processing tasks　p.27
　2.2.2.1　Transcoding　p.27
　2.2.2.2　Tokenization　p.30
　2.2.2.2.1　Review of tokenization methods for the Asian languages　p.30
　2.2.3　Index Term Extraction　p.33
　2.2.4　Query Processing Module　p.39
　2.2.5　Ranking Engine Module　p.41
　2.2.6　User Interface　p.42
　2.3　System Configuration　p.45
　2.4　Summary　p.45
Chapter3　Language Specific Web Crawler Architecture　p.47
　3.1　Introduction　p.47
　3.1.1　Non Focused Web Crawler （Typical Web Crawler）　p.48
　3.1.2　Focused Web Crawler　p.49
　3.1.3　Example of Web Crawlers　p.53
　3.1.4　Limitation of Focused Web Crawler　p.56
　3.1.5　Why LSC is needed?　p.58
　3.2　The difference approaches on LSC　p.59
　3.3　Proposed Algorithm of Language Specific Web Crawler　p.60
　3.3.1　Crawler Algorithm　p.60
　3.3.2　Language Identification Module　p.64
　3.3.3　Preparation of Training Corpus　p.65
　3.3.4　Accuracy of Language Identifier　p.67
　3.4　System Configuration　p.69
　3.5　Summary　p.69
Chapter4　Performance Evaluation　p.71
　4.1　Introduction　p.71
　4.2　Retrieval Effectiveness of LSC　p.72
　4.2.1　Relevant page acquisition rate （Recall Rate）　p.72
　4.2.2　Crawling experience through its performance　p.77
　4.2.3　Evaluation of Crawling coverage　p.78
　4.3　Retrieval Efficiency of LSC　p.79
　4.3.1　Elapsed Crawling Time　p.79
　4.3.2　Form of Presentation　p.79
　4.4　Retrieval Effectiveness of Search Engine　p.80
　4.4.1　Recall and Precision Rate　p.80
　4.4.2　Recall Rate　p.82
　4.4.3　Retrieval Effectiveness of Search Engine for specific features of the Myanmar language　p.85
　4.5　Retrieval Efficiency of Search Engine　p.88
　4.5.1　Elapsed Indexin Time　p.88
　4.5.2　Indexer size　p.89
　4.5.3　Query throughput　p.89
　4.5.4　Query Latency　p.89
　4.5.5　Form of Presentation　p.90
　4.6　Summary　p.90
Chapter5　An analysis of Myanmar Web pages on the Web　p.92
　5.1　Introduction　p.92
　5.2　Determinig the Optimum number of depth level for Crawlig Process　p.92
　5.3　The distribution of the Myanmar Web Page on the Web　p.94
　5.3.1　By Top Level Domain （TLDs）　p.95
　5.3.2　By Physical Location of Web servers　p.95
　5.4　Analysis of Encodings used in downloaded Myanmar Web page　p.98
　5.4.1　Encoding vs. Number of Web Sites　p.98
　5.4.2　TLD vs. Encodings　p.99
　5.5　Summary　p.99
Chapter6　Coclusion and Future Directions　p.100
　6.1　Introduction　p.100
　6.2　Crawling Issues　p.100
　6.2.1　HTTP Error Codes　p.101
　6.2.2　Practical Web Crawling Issues　p.101
　6.3　Conclusion　p.103
　6.4　Future Directions　p.104
　6.4.1　Improvement of Ranking Algorithm　p.104
　6.4.2　More Efficient Algorithm for Text Retrieval for large volume database　p.105
　6.4.3　Collecting More Myanmar Head Words　p.105
　6.4.4　Socio-linguistic analysis of Myanmar Web, based on graph structure analysis　p.105
　6.5　Summary　p.106
References　p.107
Appendix　p.112

論文要旨

With the enormous growth of the World Wide Web, search engines play a critical role in retrieving information from the borderless Web. Although many search engines can search for contentin numerous major languages, they are not capable of searching pages of less-computerized languages such as Myanmar due to the use of multiple non-standard encodings in the Myanmar Web pages.
The reason is that those search engines are not considering the specific features of those languages.
Myanmar language being spoken by more than 30 million people as their first language is the official language used in the administrative, judicial and commercial systems throughout the nation. Sousing Myanmar in Web site is apparently more profitable for the Myanmar speaking people regardlessof whether it is their first or second language. Besides, many Web users in Myanmar are not nativeEnglish speakers, some even do not know English at all.
In that scenario, a search engine capable of searching the Web document written in Myanmar language is highly needed, especially when more and more sites are coming up with localized contents in multiple encodings. The lack of a search tool that is specially designed for Myanmar language motivates me to do this research.
In this study, an attempt is made to design and implement a search engine for Myanmar language Web documents. The research came up with a complete language specific search engine that has a Language Specific Crawler （LSC）, indexer modules for search engine and query engine module and these all are optimized for the Myanmar language.
Since the Web is a distributed, dynamic and rapidly growing information resource, a normal Web crawler is not enough to download all pages from the entire Web. For a Language specific search engine, （LSC） is needed to collect targeted pages. The LSC that implemented in this study is multi-threaded software objects that run concurrently with language identifier. LSC relies wholly onthe technique of language identification to selectively collect the relevant documents on the Web.
The language identifier customized for this study has achieved 93%accuracy rate.
Indexer module consists of a few pre-processing steps such as transcoding （conversion of non-standard encodings to standard one）, word tokenization （segmentation of sentence to individual word） and various language resources （stop word list, synonym list, code conversion table, etc.） which are created in this study. The indexer built in this study is specially designed to
adequately handle the specific features of Myanmar language.

The query engine module also consists of a few preprocessing steps （parsing of query sentence, normalization, searching, ranking, etc.） and related language resources （stop word list） and is tuned to best fit for the Myanmar language's specific features.
The evaluation is done for the LSC and search engine separately.
Generally, a crawler must be evaluated on its ability to retrieve the "targeted pages. In this study, two aspects of crawler performance have been evaluated, such as retrieval effectiveness and efficiency. For retrieval effectiveness, relevant page acquisition rate, crawling performance, and crawling coverage of LSC are measured and evaluated. For retrieval efficiency, elapsed crawling time and the controllability of crawler are measured and/or evaluated. Results show that the implemented Language Specific Crawler algorithm collected Myanmar pages at a satisfactory level of coverage and produced a high harvest rate.
For search engine, a series of experiments has been done to prove whether it meets the design requirements. For this, two kinds of evaluation have been conducted: those concerned with retrieval effectiveness, and those concerned with the efficiency of the search engine.
For retrieval effectiveness, four indicators, recall rate, precision rate, accuracy rate and error rate are measured and evaluated. For retrieval efficiency, elapsed indexing time, indexer size, query throughput, query latency and user interface are measured and evaluated.
Finally, based on the downloaded Web pages, an analysis of Myanmar Web page was conducted such as determination of the physical location of the servers of Myanmar Web pages, analyzing the usage of different encodings. Those results provide useful information for the development of better crawler and sociolinguistic study of Myanmar Web documents.

論文審査の結果の要旨

本論文は、Design and Implementation of Search Engine for Myanmar Language Web Documents（ミャンマー語ウェブ文書サーチエンジンの設計と開発）と題し、6章より構成されている。
　第1章Introductionでは、ミャンマー語のための特別なサーチエンジン開発の必要性及びミャンマー語の特徴、文字体系の特徴等について論じる。複数のコード化方式が並存していること、ミャンマー語で書かれたページがウェブ上に希薄にしか存在しないために収集することが難しいこと、同一単語について異なる綴りが存在すること、複合語を適切に分割する必要など、解決すべき課題と研究目的が示される。
　第2章 Language Specific Search Engine Architectureでは、こうした課題にこたえるためのサーチエンジンの設計について述べる。サーチエンジンは、クエリーの解析、インデックスの検索、ランキング、表示などのサブシステムから構成され、これらの詳細について述べられる。
　第3章 Language Specific Web Crawler Architectureでは、ミャンマー語ウェブ文書を選択的に収集するクロウラーの設計について述べられる。
　第4章 Performance Evaluationでは、設計、実装されたシステムについて、収集したデータのカバレージ、効率及び正確性、並びに検索システムとしての正確性が、Precision、Recall、Crawling Performanceなどの指標を用いて評価される。本論文の推定によれば、現在ウェブ上にはミャンマー語のページが約120万ページあると推定されるが、本システムは、これを実用システムとしての利用に耐える水準で収集できていること、また、検索エンジンとして、ミャンマー語の利用者が期待する結果を検索出力できていることを確認した。
　第5章 The Analysis of Myanmar Web Documentsでは、本論文の成果を用いてミャンマー語のウェブ文書のウェブ上での分布の実態を明らかにした。
　第6章 Conclusion and Future Workでは、本論文を通じて、当初の研究開発目標を満たす初のミャンマー語サーチエンジンを実用レベルで実現したこと、これを実現するために独自のアーキテクチャーや各種言語資源（固有名詞などを含むヘッドワードリストやストップワードリストなど）を開発したことなどを述べ、今後の課題として、ミャンマー語利用者の期待に応えるランキングアルゴリズムの開発など、将来の課題について述べている。
　本論文の成果は、ミャンマー語によるインターネット利用者にとっての利便性を高めるものであり、また、今後のミャンマー語の情報処理への寄与も大きい。よって、本論文は工学上及び工業上貢献するところが大きく、博士（工学）の学位論文として十分な価値を有するものと認める。

Design and Implementation of Search Engine for Myanmar Language Web Documents （ミャンマー語ウェブ文書サーチエンジンの設計と開発）

論文目次

論文要旨

論文審査の結果の要旨

平成23（2011）年度博士論文題名一覧

長岡技術科学大学附属図書館