Mining Textual Data
Introduction to Structured Document
Clustering & ExtMiner
A lecture for TIES447 Data and Software Mining course, 18.11.2008.
Material
Recommended reading
- M. Nurminen. Tiedonlouhinta rakenteisista dokumenteista, master's thesis (in Finnish), University of Jyväskylä, dept. of Mathematical Information Technology, 2005. http://urn.fi/URN:NBN:fi:jyu-200594 .
Information retrieval, indexing & search in general
-
R. Baeza-Yates and B. Ribeiro-Neto: Modern Information Retrieval.
Addison-
Wesley, 1999.
http://www.ischool.berkeley.edu/~hearst/irbook/
-
C. D. Manning, P. Raghavan, and H. Schütze: Introduction to Information Retrieval, Cambridge University Press. 2008.
http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
-
D. R. Cutting, D. R. Karger, J. O. Pedersen, and J.W. Tukey: Scatter/gather:
a cluster-based
approach to browsing large document collections. In N. Belkin, P.
Ingwersen,
A. M. Pejtersen, and E. A. Fox (eds.): Proceedings of the 15th Annual
International
ACM SIGIR Conference on Research and Development in Information
Retrieval, pp.
318–329. ACM Press, 1992. http://doi.acm.org/10.1145/133160.133214
- G. Salton, A. Wong, and C. S. Yang. A vector space model
for automatic indexing. Commun. ACM, 18(11):613–620,
1975. http://doi.acm.org/10.1145/361219.361220
Text & web mining in general
- S. Chakrabarti: Mining the Web
- Discovering Knowledge from Hypertext Data. Morgan
Kaufmann, 2003
http://www.cse.iitb.ac.in/~soumen/mining-the-web/
- J. Dörre and P. Gerstl and R. Seiffert: Text mining: finding nuggets in mountains
of
textual data. In U. Fayyad, S. Chaudhuri, and D. Madigan (eds.):
Proceedings of
the fifth ACM SIGKDD international conference on Knowledge discovery
and data
123
mining, pp. 398–401. ACM Press, 1999. http://doi.acm.org/10.1145/312129.312299
- M. A. Hearst: Untangling text
data mining. In 37th Annual Meeting of the Association
for Computational Linguistics, pp. 3–10. Morgan Kaufmann, 1999. http://acl.ldc.upenn.edu/P/P99/P99-1001.pdf
- Y. Kodratoff: Knowledge
discovery in texts: A definition and applications. In
Z. W. Ras and A. Skowron (eds.): Foundations of Intelligent Systems,
11th International
Symposium, ISMIS ’99, Proceedings, vol. 1609 of Lecture Notes in
Computer
Science, pp. 16–29. Springer, 1999. http://citeseer.ist.psu.edu/kodratoff99knowledge.html
Document clustering, similarity measures
- E. Chávez and G. Navarro and R. Baeza-Yates and J. L.
Marroquín: Searching in
metric spaces. ACM Comput. Surv., 33(3):273–321, 2001. http://doi.acm.org/10.1145/502807.502808
- M. Cristo, P. Calado, E. S. de Moura, N. Ziviani1, and B.
Ribeiro-Neto: Link information
as a similarity measure in web classification. In G. Goos, J.
Hartmanis,
and J. van Leeuwen (eds.): String Processing and Information Retrieval,
vol.
2857 of Lecture Notes in Computer Science, pp. 43–55. Springer, 2003. http://www.springerlink.com/link.asp?id=aryan7c17m1b94ta
-
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu: A density-based algorithm for discovering
clusters in large spatial databases with noise. In E. Simoudis, J. Han, and
U. Fayyad (eds.): Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining
(KDD’96 ), pp. 226–231. AAAI Press, 1996. http://www.cs.ualberta.ca/~joerg/papers/KDD-96_final.pdf
- D. S. Modha and W. S. Spangler: Clustering
hypertext with applications to web
searching. In F. M. Shipman, III, P. J. Nürnberg, and D. L.
Hicks (eds.): Proceedings
of the eleventh ACM on Hypertext and hypermedia, pp. 143–152. ACM
Press, 2000.
http://doi.acm.org/10.1145/336296.336351
- A. Schenker, M. Last, H. Bunke, and A. Kandel: Comparison of distance measures
for graph-based clustering of documents. In E. R. Hancock and M.
Vento (eds.):
Proceedings of the Graph Based Representations in Pattern Recognition,
4th IAPR
International Workshop (GbRPR 2003), vol. 2726 of Lecture Notes in
Computer Science,
pp. 202–213. Springer, 2003. http://springerlink.metapress.com/link.asp?id=bxkjxj1vt9mlxylx
- A. Strehl: Relationship-based
Clustering and Cluster Ensembles for High-dimensional
Data Mining. PhD thesis, The University of Texas at Austin,
2002.
http://strehl.com/download/strehl-phd.pdf
- Y. Zhao and G. Karypis: Evaluation
of hierarchical clustering algorithms for document
datasets. In C. Nicholas, D. Grossman, K. Kalpakis, S. Qureshi,
H. van Dissel, and L. Seligman
(eds.): Proceedings of the Eleventh International Conference on
Information and
Knowledge Management. ACM Press, 2002, pp. 515–524. http://doi.acm.org/10.1145/584792.584877
Structured indexing & search
- S. Brin and L. Page: The
anatomy of a large-scale hypertextual web search engine. In P.
H. Enslow, Jr. and A. Ellis (eds.): Proceedings of the Seventh
International Conference
on World Wide Web 7. Elsevier, 1998.
, pp. 107–117. http://dx.doi.org/10.1016/S0169-7552(98)00110-X
- E. A. Fox, G. L. Nunn, and W. C. Lee. Coefficients of combining
concept classes in a collection. In Proc. of the 11th
Annual Int. ACM SIGIR Conf. on Research and Development
in Information Retrieval, pages 291–307. ACM Press,
1988.
http://doi.acm.org/10.1145/62437.62465
- M. E. Frisse: Searching for
information in a hypertext medical handbook. In J. B.
Smith and F. Halasz (eds.): Proceeding of the ACM Conference on
Hypertext, pp.
57–66. ACM Press, 1987. http://doi.acm.org/10.1145/317426.317433
- J. M. Kleinberg: Authoritative
sources in a hyperlinked environment. In H. Karloff
(ed.): Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete
Algo-
127
rithms, pp. 668–677. Society for Industrial and Applied Mathematics,
1998. http://portal.acm.org/citation.cfm?id=315045
- R. W. Luk, H. Leong, T. S. Dillon, A. T. Chan, W. B. Croft, and
J. Allan: A survey
in indexing and searching XML documents. Journal of the American
Society for
Information Science and Technology, 53(6):415–437, 2002. http://dx.doi.org/10.1002/asi.10056
- M. Nurminen, A. Honkaranta, and T. Kärkkäinen. ExtMiner: Combining Multiple Ranking and
Clustering Algorithms for Structured Document Retrieval. In
International workshop on Integrating Data Mining, Databases and
Information Retrieval (IDDI'05), 16th International Workshop on
Database and Expert Systems Applications. IEEE, 2005. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1508411
- F. Weigel: A survey of
indexing techniques for semistructured documents. Project
thesis, Institute of Computer Science, LMU, Munich, 2002. http://www.pms.ifi.lmu.de/publikationen/#PA_Felix.Weigel
- K. Yang. Combining Text-,
Link-, and Classification-based
Retrieval Methods to Enhance Information Discovery on the
Web. PhD thesis, University of North Carolina, 2002. http://www.webir.org/resources/phd/Yang_2002.pdf