Summary
The course deals with basic principles of analysis of text documents. Text documents are understood as a typical representative of weak structured data. Individual areas of processing of text data - documents, web pages will be presented. The subject includes algorithms for pattern matching in the text, design of index systems for text data, work with natural languages in which texts are written. The various approaches to searching in text data, including methods of latent semantics analysis, will be also described. At the end, the course focuses on web search.
Lecturer
doc. Mgr. Jiří Dvorský, Ph.D.
Lectures
Syllabus of lectures
- Introduction to Information retrieval systems (IRS). Short history of information retrieval system. Differences between relational databases and textual databases. General model of information retrieval system.
- Stringology. Exact pattern matching algorithms. Single and multiple pattern matching algorithms. Regular expressions and finite automata. Approximate pattern matching algorithms.
- Suffix trees. Directed acyclic word graphs. Patricia trees.
- Primary text processing. Lexical analysis. Stemming. Stopwords.
- Construction of index systems. Zipf law and index system size estimation. Sorting-based indexing. Positional index systems. Term weighting methods. TF-IDF weighting schema. Index systems compression methods. Coding of natural numbers.
- IRS query languages. Document relevency. Document query similarity. Relevency vs. similarity. Structure and evaluation of queries. Boolean model of IRS. Evaluation of IRS (precision, recall, F-measure)
- Signature methods. Sliced and layered signatures. Effective evaluation of signature queries.
- Latent semantics. Dimension reduction methods. Matrix factorization methods. Random projection. Vector model of IRS. Construction and evaluation of vector queries. Extended Boolean IRS.
- Web searching. Hypertext document analysis. PageRank HITS. Metasearching and cooperative searching. Softcomputing in IRS.
Project
- You can choose a research or an implementation project.
- The project has to be submitted using the Dropbox. There is no need to have an account here.
- The project is submitted as a Pdf document (research project) or Zip archive with source codes (implementation project). Pdf document or Zip archive file name should match your student ID (so called "login name" or simply "login"). For example, student James Bond has login bon007, the archive will be named bon007.pdf or bon007.zip.
- The deadline for the projects submission is May 19, 2024.
Research project - survey report on
- Mathematical texts indexing and searching methods
- Unsual data, i.e. music material, indexing methods
- Index structures compression
- Text document compression
- Parallel text processing
- Open source software for text processing
- Libraries for dimensionality reduction (SVD, NMF etc.)
- Compression methods based on context-free grammar
Implementation project topics
- Implementation of for example approximate pattern matching algorithm, see PDF file Approx Pattern Matching.pdf.
- Finite automata general library - implementation of NFA, DFA, conversion from NFA to DFA, automata serialization etc.
References
- Manning, C. D.; Raghavan, P. & Schutze, H.: Introduction to Information Retrieval, Cambridge University Press, 2008.
- Witten I. H., Moffat A., Bell T. C.: Managing Gigabytes (2nd ed.): Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers Inc., 1999, ISBN 1-55860-570-3
- Baeza-Yates R. A., Ribeiro-Neto B.: Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., 1999, ISBN 020139829X
- Feldman R., Sanger J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2006, ISBN 978-0521836579
- Berry M. W., Kogan J.: Text Mining: Applications and Theory, Wiley, 2010, ISBN 978-0470749821
- Weiss S. M., Indurkhya N., Zhang T.: Fundamentals of Predictive Text Mining, Springer, 2010, ISBN 978-1849962254
- Langville, A. N. & Meyer, C. D. Google's PageRank and Beyond: The Science of Search Engine Rankings Princeton University Press, 2006
- Korfhage, R. R. Information Storage and Retrieval, John Wiley & Sons, 1997
Supporting materials