Methods of Analysis of Textual Data

English Lectures

Summary

Lectures and syllabus

Project

References

Supporting materials

Summary

The course deals with basic principles of analysis of text documents. Text documents are understood as a typical representative of weak structured data. Individual areas of processing of text data - documents, web pages will be presented. The subject includes algorithms for pattern matching in the text, design of index systems for text data, work with natural languages in which texts are written. The various approaches to searching in text data, including methods of latent semantics analysis, will be also described. At the end, the course focuses on web search.

Lecturer

doc. Mgr. Jiří Dvorský, Ph.D.


Lectures

Syllabus of lectures

  1. Introduction to Information retrieval systems (IRS). Short history of information retrieval system. Differences between relational databases and textual databases. General model of information retrieval system.
  2. Stringology. Exact pattern matching algorithms. Single and multiple pattern matching algorithms. Regular expressions and finite automata. Approximate pattern matching algorithms.
  3. Suffix trees. Directed acyclic word graphs. Patricia trees.
  4. Primary text processing. Lexical analysis. Stemming. Stopwords.
  5. Construction of index systems. Zipf law and index system size estimation. Sorting-based indexing. Positional index systems. Term weighting methods. TF-IDF weighting schema. Index systems compression methods. Coding of natural numbers.
  6. IRS query languages. Document relevency. Document query similarity. Relevency vs. similarity. Structure and evaluation of queries. Boolean model of IRS. Evaluation of IRS (precision, recall, F-measure)
  7. Signature methods. Sliced and layered signatures. Effective evaluation of signature queries.
  8. Latent semantics. Dimension reduction methods. Matrix factorization methods. Random projection. Vector model of IRS. Construction and evaluation of vector queries. Extended Boolean IRS.
  9. Web searching. Hypertext document analysis. PageRank HITS. Metasearching and cooperative searching. Softcomputing in IRS.

Project

Research project - survey report on

  1. Mathematical texts indexing and searching methods
  2. Unsual data, i.e. music material, indexing methods
  3. Index structures compression
  4. Text document compression
  5. Parallel text processing
  6. Open source software for text processing
  7. Libraries for dimensionality reduction (SVD, NMF etc.)
  8. Compression methods based on context-free grammar

Implementation project topics

  1. Implementation of for example approximate pattern matching algorithm, see PDF file Approx Pattern Matching.pdf.
  2. Finite automata general library - implementation of NFA, DFA, conversion from NFA to DFA, automata serialization etc.

References

  1. Manning, C. D.; Raghavan, P. & Schutze, H.: Introduction to Information Retrieval, Cambridge University Press, 2008.
  2. Witten I. H., Moffat A., Bell T. C.: Managing Gigabytes (2nd ed.): Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers Inc., 1999, ISBN 1-55860-570-3
  3. Baeza-Yates R. A., Ribeiro-Neto B.: Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., 1999, ISBN 020139829X
  4. Feldman R., Sanger J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2006, ISBN 978-0521836579
  5. Berry M. W., Kogan J.: Text Mining: Applications and Theory, Wiley, 2010, ISBN 978-0470749821
  6. Weiss S. M., Indurkhya N., Zhang T.: Fundamentals of Predictive Text Mining, Springer, 2010, ISBN 978-1849962254
  7. Langville, A. N. & Meyer, C. D. Google's PageRank and Beyond: The Science of Search Engine Rankings Princeton University Press, 2006
  8. Korfhage, R. R. Information Storage and Retrieval, John Wiley & Sons, 1997

Supporting materials