Automatic language identification for written text with byte-sequence-based N-gram algorithm (バイト列Nグラムアルゴリズムを用いたテキストの自動言語判別)
氏名 CHEW YEW CHOONG
学位の種類 博士(工学)
学位記番号 博甲第591号
学位授与の日付 平成23年6月30日
学位論文題目 Automatic language identification for written text with byte-sequence-based N-gram algorithm (バイト列Nグラムアルゴリズムを用いたテキストの自動言語判別)
論文審査委員
主査 教授 三上 喜貴
副査 教授 淺井 達雄
副査 准教授 岩橋 政宏
副査 准教授 湯川 高志
副査 産業技術大学院大学准教授 中鉢 欣秀
[平成23(2011)年度博士論文題名一覧] [博士論文題名一覧]に戻る.
Table of Contents
LIST OF TABLES p.8
LIST OF FIGURES p.10
LIST OF ABBREVIATIONS/SYMBOLS p.11
ABSTRACT p.12
CHAPTER1 INTRODUCTION p.16
DIGITAL LANGUAGE DIVIDE p.17
MEASURE LANGUAGES ON THE INTERNET p.19
LANGUAGE IDENTIFICATION p.21
ORGANIZATION OF THIS THESIS p.24
CHAPTER2 RELATED WORKS p.26
CAVNAR TRENKLE ALGORITHM p.27
SUZUKI ALGORITHM p.28
MARTIN ALGORITHM p.29
CONCERN ON EXISTING STUDIES p.31
CHAPTER3 DATA p.34
TRAINING CORPUS p.34
VALIDATION CORPUS p.36
CHAPTER4 MIXED N-GRAM ORDER MODEL p.40
N-GRAM p.40
FIXED N-GRAM ORDER p.42
MIXED N-GRAM ORDER p.42
CHAPTER5 BYTE-SEQUENCE BASED HTML PARSER p.43
UNRELIABLE HTML AND XML'S LANGUAGE ATTRIBUTES p.43
HYPER TEXT MARKUP LANGUAGE AND HTML PARSER p.44
CHAPTER6 UNICODE AND HTML CHARACTER ENTITIES p.50
CHAPTER7 N-GRAM BASED CHLASSIFIERS p.55
N-GRAM FREQUENCY STATISTICS p.55
RANK-ORDER STATISTICS (ROS) p.55
ACCUMULATIVE N-GRAM FREQUENCIES ADDITION (AFA) p.59
DISTINCT N-GRAM BOOLEAN MATCHING (DNBM) p.60
CHAPTER8 EXPERIMENTS AND RESULTS p.63
EXPERIMENT 1 EVALUATION OF LANGUAGE IDENTIFICATION PERFORMANCE BASED ON TRAINING CORPUS A p.63
EXPERIMENT 2 EVALUATION OF LANGUAGE IDENTIFICATION PERFORMANCE BASED ON TRAINING CORPUS B p.64
EXPERIMENT 3 EVALUATION OF LANGUAGE IDENTIFICATION PERFORMANCE BASED ON FIXED N-GRAM ORDERS(1-6) p.66
EXPERIMENT 4A EVALUATION OF LANGUAGE IDENTIFICATION PERFORMANCE BASED ON UNTWEAKED MIXED N-GRAM MODEL p.69
EXPERIMENT 4B EVALUATION OF LANGUAGE IDENTIFICATION PERFORMANCE BASED ON TWEAKED MIXED N-GRAM MODEL p.70
EXPERIMENT 5 EVALUATION ON CHARACTER SEQUENCE BASED HTML PARSER p.72
EXPERIMENT 6 EVALUATION ON BYTE-SEQUENCE BASED HTML PARSER p.73
EXPERIMENT 7 EVALUATION ON HTML CHARACTER ENTITIES CONVERTER p.74
EXPERIMENT 8 EVALUATION ON RANK-ORDER STATISTICS p.75
EXPERIMENT 9 EVALUATION ON ACCUMULATIVE FREQUENCIES ADDITION p.75
EXPERIMENT 10 EVALUATION ON DISTINCT N-GRAM BOOLEAN MATCHING p.76
LANGUAGE IDENTIFICATION RESULT BASED ON DIFFERENT WRITING SYSTEM p.80
CHAPTER9 CONCLUSION AND FUTURE WORKS p.81
APPENDIXE p.83
LIST OF LANGUAGE, SCRIPT, ENCODING AND FONT TRAINED IN TRAINING CORPUS p.83
LIST OF LANGUAGES AND THEIR FREQUENCY IN TRAINING CORPUS p.88
BIBLIOGRAPHY p.96
Language identification technology widely used in the domain of machine learning, text mining and information retrieval.
Many researchers have achieved excellent results on selected Latin-script-based languages.However,Past researchers often have the following limitation-(1)they were focus on Latin-scripts based languages;
(2)the performance of the algorithm is test on plain text only; and(3)many experiments utilized a same corpus for both training and validating.Besides,new challenges arise when language identification is applied to non-Latin-scripts based languages, especially for content in web page format.These new challenges include the need of effective methods to handle web pages with misleading or missing character encoding information and web pages with HTML character entities.
The objective of this research is to propose and evaluate the effectiveness of an N-gram based language identification algorithm that implemented three different types of language classifiers.The principal idea of using n-gram for language identification is that every language contains its own unique n-grams and tends to use certain n-grams more frequently than others,hence providing a clue about the language.The sub-objective included evaluation of the effectiveness of Universal Declaration of Human Rights and Biblical texts as a training corpus for language identification task; fo1low by the evaluation of performance and accuracy of fixed and mixed n-gram models on language identification task.
Moreover, two new heuristics were proposed in order to improve language identification performance on web pages, especially pages with non-Latin-Scripts based languages.Three N-gram based classifiers have been implemented to measure similarity between languages of training and target document.The first classifier is based on Rank-Order Statistics method, Which calculate the “distance” between each language; the second classifier is based on Accumulative N-gram Frequencies Addition method, Where language similarity is determine by frequency of matching n-grams; the last and newly introduced classifier is based on Distinct N-gram Boolean Matching method, Which value each n-gram evenly by its occurrence, not by its frequency. The algorithm constructs an n-grams distribution vector of byte sequences for each language’s profile during the training phase and Proclaims this vector the language model. During language identification phase, the algorithm constructs an n-grams distribution vector for target document in a similar way and then compares it against each trained language model. The language model that produces highest matching rate(when Accumulative N-gram Frequencies Addition method or Distinct N-gram Boolean Addition method is used as classifier)or smallest distance(when Rank-Order Statistics method is using)is selected as the identified language of the target document. Extension of the training corpus produced improved results.The overall performance of the algorithm was evaluated based on a written text corpus of 1,660 web pages, spanning 182 languages from Africa, the Americas, Asia, Europe and Oceania. Extension of the training corpus produced improved accuracy; overall correct classification rate of language identification was improved from 74.76% to 86.99%.
Although mixed n-gram model doesn’t improve the overall correct classification rate, it improved the overall performance of language identification by 20.39%less processing time and 5.25% 1ess memory space consumption.Improvement Was also achieved by using byte-sequence based HTML parser that is able to extract text from the web page,Without relying on character encoding information Overall correct classification rate of language identification improved from 86.99% to 89.94%% by using byte-sequence based HTML parser.Another improvement of the algorithm is a HTML character entity converter that is able to translate HTML character entities to Unicode characters.The converter further improved overall correct classification rate of language identification from 90.00% to 92.95%. Experimental result showed that, among the three classifiers, the newly introduced Distinct N-gram Boolean matching classifier achieved highest correct classification rate of 95.60%.
本論文は、「Automatic language identification for written text with byte-sequence-based N-gram algorithm(バイト列Nグラムアルゴリズムを用いたテキストの自動言語判別)」と題し、9章より構成されている。
第1章"Introduction"では、世界の言語、言語間における情報技術利用水準の格差(言語間デジタルデバイド)の現状、自動言語判別技術とは何かについて述べ、あわせて本研究の目的について述べる。
第2章は、"Related Worksと題し、既存研究をレビューする。言語自動判別技術に関する既存研究は、欧米言語一特にラテンアルファベットを用いる言語一についての研究がほとんどであること、ウェブ上の自然言語テキスト表現が持つ特殊性を考慮していないこと、判別可能な言語や文字の種類についても制約が大きいこと等が示される。
第3章"Data"と題し、本研究で用いる訓練用データと検証用データの詳細(データのソース、言語・文字・文字コードで見た種類数、地理的分布特性)について述べる。
第4章は、"Mixed N-Gram Order Model"と題し、本研究で取り組んだNグラムアルゴリズムの基本的なアルゴリズムとそのバリエーションが提示される。
第5章は、"Byte-Sequence Based HTML Parser"と題し、本研究において、バイト列ベースでNグラム判別を行う理由と、実装上の解決策が示される。
第6章は、"Unicode and HTML Character Entities"と題し、特に非ラテン文字を使用する言語ではウェブ上にHTML参照型の文字が多数存在するために言語判別に困難が伴うことが示され、これを解決するための実装上の解決策が示される。
第7章は、"N-Gram Based Classifier"と題し、本研究が提案する類似度計算アルゴリズムの特徴が既存手法との比較において提示される。
第8章は、"Experiments and Results"と題し、本論文の提案するアルゴリズムの判別精度の検証結果が示される。検証実験は第3章から第7章に述べた改善策ごとの寄与が明確となるように設計されている。最終的には、95%を超える判別精度が達成された。
第9章は、"Conclusion and Future Works"題し、全体の総括と今後の課題が行われる。自動言語判別技術は、データマイニング、情報検索、機械学習など応用範囲の広い基本技術である。ますます多言語化しつつあるウェブ上の各種アプリケーションにおいて、本研究によっては達成された判別技術は、広範囲にわたる言語/文字/文字コードのテキストを対象として実用的な精度での判別を実現するものであり、その価値はきわめて大きい。よって、本論文は工学上及び工業上貢献するところが大きく、博士(工学)の学位論文として十分な価値を有するものと認める。