Text Mining Logo

Natural Language Processing and Text Mining (LPTM)

Big data is mostly unstructured data and among the variety formats of unstructured data text data has the most potential valuable insights. Text data analysis has a unique challenge beyond the conventional machine learning models. Text mining will utilize specific language rules/grammar (Natural Language Processing – NLP), specific preprocessing to text data such as n-grams, stop-words filtering, and spell-checks or corrections, and specific language models including recent embedding models and deep learning.

Prerequisites                          : SCM, IMUL, ADM

Objectives/Content               :

  1. Introduce comprehensive preprocessing techniques for text data, ranging from tokenization, normalizations, filtering, and autocorrect.
  2. Exploring basic text analytics as a way to do simple text data exploratory analysis.
  3. Introduce text mining methods to produce deep insights from text data or create text analytics engine that will be used at the production level.
  4. Introduce Natural Language processing

Reference                               :

  1. Farzindar, A., & Inkpen, D. (2017). Natural language processing for social media. Synthesis Lectures on Human Language Technologies, 10(2), 1-195.
  2. Kao, A., & Poteet, S. R. (Eds.). (2007). Natural language processing and text mining. Springer Science & Business Media.
  3. Perkins, J. (2014). Python 3 Text Processing with NLTK 3 Cookbook. Packt Publishing Ltd.
  4. http://www.nltk.org/book/
Topic IDTopic TitleLessons
LPTM1Introduction to Natural Language Processing– Tokenization
– Stemming and lemma
– Spellcheck
– Stopwords filtering
– WordNet
– Named Entity Recognition
– Tree Parser, PoS tagger
LPTM2Introduction to Text Mining 1– Vector Space Model (TF-IDF, BM25, etc)
– (Soft) Clustering – Topic Modelling
– Document Classification
– Sentiment Analysis
– Document Summarization & Recommendation

LPTM3Text Mining via Deep Learning– Introduction to Deep Learning
– Word Embedding (Word2Vec & FastText)
– LSTM
– CNN
– Multiclassification & Multilabel problems (soft classification)
– BERT
LPTM4Social Media Analytics– Crawling, Streaming, Scrapping
– Building the network (graph) + Visualizations
– Centrality Analysis
– Graph Partitioning
– Community detection
– Combining Social Network Analysis with text Analytics