Representations and Metrics in High-Dimensional Data Mining
„Informatičko doba" u kom se nalazimo predstavlja brojne izazove svakodnevnom životu i radu. Jedan od ozbiljnijih izazova je velika količina informacija sa kojom treba „izaći na kraj", što je posledica veoma brzog razvoja tehnologije i sveopšte kompjuterizacije. Iako nam, s jedne strane, računari pomažu da se izborimo s informacijama bitnim za nas (putem, na primer, elektronske pošte, društvenih mreža, internet pretraživača, itd.), s druge strane na računarima se gomilaju ogromne količine informacija u „sirovom" stanju, odnosno obliku u kom ih je teško analizirati, upotrebiti za namene drugačije od prvobitne, ili iz njih nešto naučiti. Informatička disciplina „data mining" (analiza podataka) omogućava da se iz velike količine sirovih podataka izvuku zanimljive pravilnosti i korisno znanje. Ova knjiga se bavi problemima koji proizilaze iz velikog broja atributa u modernim bazama podataka, često nazivanim „prokletstvom dimenzionalnosti", i njegovim uticajem na različite tehnike i aspekte analize podataka.
The current "information age" presents numerous challenges in everyday life and work. One of the more serious challenges is the great volume of information that needs to be processed and "defeated," which is a consequence of the rapid development of technology and ubiquitous computerization. Although, on one hand, computers help us to cope with information we find relevant (for example, by means of e-mail, social networks, search engines, etc.), on the other hand large volumes of information are amassed on computers in "raw" form, that is in a form which is difficult to analyze, use for means other than initially intended, or learn from. The computer-science discipline of data mining enables the extraction of interesting patterns and useful knowledge from large volumes of raw data. This book deals with problems stemming from large numbers of attributes in modern data bases, often referred to as "the curse of dimensionality," and its influence on different techniques and aspects of data mining.
CONTENTS
PREFACE, 5
I PRELlMINARIES, 7
1 INTRODUCTION, 9
1.1 Book Outline, 11
1.2 Contributions, 12
2 MACHINE LEARNING, DATA MINING, AND INFORMATION RETRIEVAL, 14
2.1 Data Representation, 17
2.2 Distance and Similarity Measures, 23
2.3 Classification, 27
2.4 Semi-Supervised Learning, 45
2.5 Clustering, 47
2.6 Outlier Detection, 55
2.7 Information Retrieval, 58
2.8 Dimensionality Reduction, 62
2.9 Summary, 72
II METRICS, 73
3 THE CONCENTRATION PHENOMENON, 75
3.1 Concentration of Distances, 75
3.2 Concentration of Cosine Similarity, 78
3.3 Proofs of Theorems 7 and 8, 81
4 THE HUBNESS PHENOMENON, 85
4.1 Related Work, 86
4.2 Observing Hubness, 87
4.3 Explaining Hubness, 90
4.4 Proof of Theorem 9, 95
4.5 Discussion, 104
5 HUBNESS AND MACHINE LEARNING, 112
5.1 Related Work, 112
5.2 Observing Hubness in Real Data, 113
5.3 Explaining Hubness in Real Data, 116
5.4 Hubs and Outliers, 117
5.5 Hubness and Dimensionality Reduction, 119
5.6 Impact of Hubness on Machine Learning, 120
5.7 Summary and Future Work, 136
6 HUBNESS AND TIME SERIES, 138
6.1 Related Work, 140
6.2 Observing Hubness in Time Series, 141
6.3 Explaining Hubness in Time Series, 141
6.4 Hubness and Dimensionality Reduction, 144
6.5 Impact of Hubness on Time-Series Classification, 146
6.6 Experimental Evaluation, 151
6.7 Summary and Future Work, 155
7 HUBNESS AND INFORMATION RETRIEVAL, 156
7.1 Observing Hubness in Text Data, 157
7.2 Explaining Hubness in Text Data, 160
7.3 Hubness and Dimensionality Reduction, 166
7.4 Impact of Hubness on Information Retrieval, 167
7.5 Summary and Future Work, 171
III DOCUMENT REPRESENTATlON AND FEATURE SELECTlON, 175
8 TERM WEIGHTING FOR TEXT CATEGORIZATION, 177
8.1 Related Work, 178
8.2 The Experimental Setup, 178
8.3 Results, 181
8.4 Summary and Future Work, 191
9 TERM WEIGHTING AND FEATURE SELECTION, 193
9.1 The Experimental Setup, 194
9.2 Results, 196
9.3 Summary and Future Work, 204
9.4 A Note on Hubness, Feature Selection, and Generation, 206
10 CONCLUSION, 209
A TERM WEIGHTING IN THE BOW REPRESENTATION, 213
Al Term Weighting Without Stemming, 213
A2 Term Weighting With Stemming, 214
BIBLIOGRAPHY, 217
ABOUT THE AUTHOR, 235
O AUTORU, 236
SAŽETAK, 237
Detaljni podaci o knjiziNaslov: Representations and Metrics in High-Dimensional Data Mining
Izdavač: Izdavačka knjižarnica Zorana Stojanovića
Strana: 243 (cb)
Povez: meki
Pismo: latinica
Format: 22.5 x 14 cm
Godina izdanja: 2011
ISBN: 978-86-7543-231-9