Methods of Analysis of Textual Data

Summary

The course deals with basic principles of analysis of text documents. Text documents are understood as a typical representative of weak structured data. Individual areas of processing of text data - documents, web pages will be presented. The subject includes algorithms for pattern matching in the text, design of index systems for text data, work with natural languages in which texts are written. The various approaches to searching in text data, including methods of latent semantics analysis, will be also described. At the end, the course focuses on web search.

Lecturer

doc. Mgr. Jiří Dvorský, Ph.D.

Lectures

Syllabus of lectures

Introduction to Information retrieval systems (IRS). Short history of information retrieval system. Differences between relational databases and textual databases. General model of information retrieval system.
Stringology. Exact pattern matching algorithms. Single and multiple pattern matching algorithms. Regular expressions and finite automata. Approximate pattern matching algorithms.
Suffix trees. Directed acyclic word graphs. Patricia trees.
Primary text processing. Lexical analysis. Stemming. Stopwords.
Construction of index systems. Zipf law and index system size estimation. Sorting-based indexing. Positional index systems. Term weighting methods. TF-IDF weighting schema. Index systems compression methods. Coding of natural numbers.
IRS query languages. Document relevency. Document query similarity. Relevency vs. similarity. Structure and evaluation of queries. Boolean model of IRS. Evaluation of IRS (precision, recall, F-measure)
Signature methods. Sliced and layered signatures. Effective evaluation of signature queries.
Latent semantics. Dimension reduction methods. Matrix factorization methods. Random projection. Vector model of IRS. Construction and evaluation of vector queries. Extended Boolean IRS.
Web searching. Hypertext document analysis. PageRank HITS. Metasearching and cooperative searching. Softcomputing in IRS.

Project

You can choose a research or an implementation project.
The project has to be submitted using the Dropbox. There is no need to have an account here.
The project is submitted as a Pdf document (research project) or Zip archive with source codes (implementation project). Pdf document or Zip archive file name should match your student ID (so called "login name" or simply "login"). For example, student James Bond has login bon007, the archive will be named bon007.pdf or bon007.zip.
The deadline for the projects submission is May 19, 2024.

Research project - survey report on

Mathematical texts indexing and searching methods
Unsual data, i.e. music material, indexing methods
Index structures compression
Text document compression
Parallel text processing
Open source software for text processing
Libraries for dimensionality reduction (SVD, NMF etc.)
Compression methods based on context-free grammar

Implementation project topics

Implementation of for example approximate pattern matching algorithm, see PDF file Approx Pattern Matching.pdf.
Finite automata general library - implementation of NFA, DFA, conversion from NFA to DFA, automata serialization etc.

References

Manning, C. D.; Raghavan, P. & Schutze, H.: Introduction to Information Retrieval, Cambridge University Press, 2008.
Witten I. H., Moffat A., Bell T. C.: Managing Gigabytes (2nd ed.): Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers Inc., 1999, ISBN 1-55860-570-3
Baeza-Yates R. A., Ribeiro-Neto B.: Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., 1999, ISBN 020139829X
Feldman R., Sanger J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2006, ISBN 978-0521836579
Berry M. W., Kogan J.: Text Mining: Applications and Theory, Wiley, 2010, ISBN 978-0470749821
Weiss S. M., Indurkhya N., Zhang T.: Fundamentals of Predictive Text Mining, Springer, 2010, ISBN 978-1849962254
Langville, A. N. & Meyer, C. D. Google's PageRank and Beyond: The Science of Search Engine Rankings Princeton University Press, 2006
Korfhage, R. R. Information Storage and Retrieval, John Wiley & Sons, 1997

Supporting materials

Powerpoint slides
New updated slides: presentation or handouts