Data Mining IF

Applied Data Mining (ADM)

In industry 4.0 era, data has been growing rapidly. This explosive growth of stored and transient data has generated an urgent need for efficient and effective techniques that can assist in transforming this data into useful information, knowledge, or insights. Data mining has emerged as a multidisciplinary field that addresses this need.

This module discusses techniques for preprocessing data before mining, business understanding, hypothesis building, building optimal models, model evaluations and interpretations, and data generalization. It presents methods for mining frequent patterns, associations, and correlations. It also presents methods for data classification and prediction, data-clustering approaches, and outlier analysis.

Prerequisites                          : GLM, SCM, IMUL.

Objectives/Content               :

  1. Be able to approach data mining as a process, by demonstrating competency in the use of CRISP-DM, the Cross-Industry Standard Process for Data Mining, including the business understanding phase, the data understanding phase, the exploratory data analysis phase, the modeling phase, the evaluation phase, and the deployment phase.
  2. Be proficient with data mining software/tools such as Python.
  3. Understand and apply a wide range of clustering, estimation, prediction, and classification algorithms, including k-means clustering, classification and regression trees, logistic Regression, k-nearest neighbor, multiple regression, and neural networks.
  4. Understand and apply the most current data mining techniques and applications, such as text mining and social media analytics.
  5. Understand the mathematical statistics foundations of the algorithms outlined above.

Evaluations/Assignments:

  1. At the end of the fundamental lessons in this module, trainee will be given a dataset and the metadata (story) behind it. The trainees than need to form a team and apply data mining process to find as many important insights as possible from the data. The evaluation is based on the report and presentation of the findings. The case study can be taken from real dataset from trainee’s division/department or from any other source such as Kaggle.
  2. Evaluation on the advance data mining topics is based on the speedup, efficiency, deep insights, an-or data creativity from a more challenging data problem. Such as data with high-dimensionality, multimodal, fine-grained, and so on.
  3. Online quizzes in the eLearning platform.

Deliveries:

  1. In the online module, basic applications and best use case of each models are given.
  2. Data challenge project, especially using datasets that are not yet optimally explored in each division.

Reference:

  1. Data Mining: Concepts and Techniques by J Han, M Kamber & J Pei, 2012, 3rd edition, Morgan Kaufmann.
  2. Aggarwal, C. C. (2015). Data mining: the textbook. Springer.
  3. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi. Discovering Data Mining: From Concept to Implementation. IBM, 1997
  4. Fayyad, G. Piatetsky-Shapiro, and P. Smith. From data mining to knowledge discovery. AI Magzine,Volume 17, pages 37-54, 1996.
  5. Barry, A. J. Michael & Linoff, S. Gordon. 2004. Data Mining Techniques. Wiley Publishing, Inc. Indianapolis : xxiii + 615 hlm.
  6. Hand, David etc. 2001. Principles of Data Mining. MIT Press Cambridge, Massachusetts : xxvii + 467 hlm.
  7. Hornick, Mark F., Marcade, Erik & Vankayala, Sunil. 2007. Java Data Mining: Strategy,Standard, and Practice. Morgan Kaufman. San Francisco : xxi + 519 hlm.
  8. Tang, ZhaoHui & Jamie, MacLennan. 2005. Data Mining with SQL Server 2005. Wiley Publishing, Inc. Indianapolis : xvii + 435 hal
  9. Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
  10. Yang, X. S. (2019). Introduction to Algorithms for Data Mining and Machine Learning. Academic Press.
  11. Simovici, D. (2018). Mathematical Analysis for Machine Learning and Data Mining. World Scientific Publishing Co., Inc..
  12. Zheng, A. (2015). Evaluating machine learning models: a beginner’s guide to key concepts and pitfalls.
  13. Mitchell, T. M. (1997). Machine learning. 1997. Burr Ridge, IL: McGraw Hill45(37), 870-877.
Topic IDTopic TitleLessons
ADM1Overview Data Mining– Data Mining process (CRISP, SEMMA, CCC, etc.)
– inductive Bias
– Statistical challenge in Data Mining: Bonferroni’s Principle
– Data Types and Models
– EDA reviewed
ADM2Predictive Data Mining– Focus on case studies (applications) (Classification, Clustering, Time Series, etc)
– Data Mining in Marketing and Customer Relationship Management
– Hazard Functions and Survival Analysis in Marketing and health
– Data Mining throughout the Customer Life Cycle”
ADM5Introduction to Recommendation Models– Association Rules (Market Basket Analysis)
– Efficient Frequent Mining
– Evaluation metrics for recommendation engines: Recall & Precision, RMSE, Mean Reciprocal Rank, MAP at k, NDCG
– Implicit-Explicit ratings
– Content Based
– Collaborative and memory model
ADM6Feature Engineering– Feature Selection
– Feature Extraction
– Feature engineering on unstructured data
ADM7Web Mining– Introduction to Log Analytics
– Anomaly detection
– Intrusion Detection
– Fraud Detection
– Web mining to improve search function
– landing page optimization
– User behaviour prediction

ADM9Advanced Recommendation Models– Ensemble recommendation
– Learning to Rank
– Contextual Bandit
– Cold Start recommendation
– Cascade recommendation models
– Deep Generative models
ADM10Mining on Time Data* Special topic for trainee that already taken TSA and SDA
-Trend Analysis
-Outlier/anomaly detection
-Concept Drift**
Temporal clustering**
ADM11Link Analysis-Basic Graph Theory
-Kleinberg Algorithm
-Finding Hubs and Authorities
-Case study on Link Analysis (customer segmentation)
ADM12Privacy-Preserving Data Mining– Introduction to Privacy Preserving Models
– Privacy during Data Collection
– Reconstructing Aggregate Distributions
– Leveraging Aggregate Distributions for Data Mining
– Privacy-Preserving Data Publishing
– The k-anonymity Model
– Samarati’s Algorithm & Incognito
– Mondrian Multidimensional k-Anonymity
– Synthetic Data Generation: Condensation-based Approach
– The \ell-diversity Model
– The t-closeness Model
– Output Privacy
– Distributed Privacy

Leave a Reply