The process of learning from data normally starts by investigating whether some dependencies between attributes (variables and-or features) exist in the data. For example, users might want to learn how a form of cancer can be diagnosed based on certain attributes present in the data set. In this case, an attribute labelling data points for the particular cancer disease is usually available in the data. This process is normally referred to as classification or supervised learning. However, in some cases users are unaware of the relationships between attributes in the data. They want to learn the underlying grouping structure without having any prior knowledge, that is without the reference for assessing the formed clusters. In other words, the data is only available to them as unlabeled. This type of learning is referred as unsupervised learning or clustering.
Finding partitional information in the data (i.e., clustering analysis) is generally equated to the process of finding the natural groupings (or structures) in a collection of entities or objects. These clustering results can then be used to understand how the data are grouped and are expected to be substantially meaningful or to have valuable information. In some cases, the clustering result is used to produce a visualization showing an overview of information in the data. Clustering analysis plays an important role in understanding data and is one of the most important tools in big data analytics.
In this module various approach to clustering is discussed. The focus of the discussions is centered around the evaluation, use case, and interpretations. Efficient implementations in a real-world scenario (production level) is also discussed. As well as clustering as part of hybrid methods or one of the tools in data analysis or beyond clustering and discuss dimension reduction techniques for special purposes in data analysis.
Prerequisites : MFDS, SFDS, ADSP, DFDS
- Clustering as part of exploratory data analysis to get a deeper analysis from the data.
- Being able to correctly evaluate and interpret clustering results.
- Determine how and when to apply different methods of clustering analysis.
- Finding a latent structure/information from the data via clustering analysis.
- Using clustering analysis or dimensional reduction techniques as tools for other model(s) or complex visualizations.
- Aggarwal, C. C. (2015). Data mining: the textbook. Springer.
- Everitt, B., et al., (2011). Cluster Analysis: Wiley Series in Probability and Statistics. Chichester: Wiley.
- Berkhin, P.: A survey of clustering data mining techniques. In: Grouping Multidimensional Data: Recent Advances in Clustering (2006). https://doi.org/10.1007/3-540-28349-8_2.
- Aggarwal, C.C., Reddy, C.K.: DATA Custering Algorithms and Applications. (2013).
- Basu, S., Davidson, I., Wagstaff, K.L.: Constrained clustering: Advances in algorithms, theory, and applications. (2008).
- Haroon, D.: Python Machine Learning Case Studies. (2017). https://doi.org/10.1007/978-1-4842-2823-4.
- Myatt, G.J., Johnson, W.P.: Making Sense of Data I & II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications. (2008). https://doi.org/10.1002/9780470417409.
- Bonaccorso, G.: Hands-On Unsupervised Learning with Python: Implement machine learning and deep learning models using Scikit-Learn, TensorFlow, and more. Packt Publishing (2019).
- Raschka, S., Julian, D., Hearty, J.: Python: Deeper Insights into Machine Learning. Packt Publishing (2016).
|Topic ID||Topic Title||Lessons|
|ULIM1||Introduction to Clustering Analysis 1||– Introduction to Clustering (segmentations): what is clustering?|
– Various objectives/applications of Clustering
– Feature selection for clustering
– Evaluations (Internal, External, & Applications)
– Elbow Methods (selecting number of clusters)
– Centroid-based Clustering (k-Means/Medoids, k-Means++, minibatch k-Means, MeanShift, etc)
– Hierarchical Clustering (discussions on various linkages, wards, etc.)
– Visualizations & Interpretations: Some case studies.
|ULIM2||Introduction to Clustering Analysis 2||– Density-based clustering: DBSCAN, Optics, Denclue, etc.|
– Distribution-based clustering: EM, SOM, etc.
– Spectral Clustering
– Birch and other tree-based clustering
– Selecting Clustering Method(s)
|ULIM3||ULIM Grouped Discussions||– Recap discussions of ULIM 1- 2|
– Focus on building a series of preprocessing, building hypothesis, modelling, evaluation, and interpretations.
– Include reporting.
|ULIM4||Dimensional Reduction Methods||– SVD: Singular Value Decompositions|
– Random Indexing
– PCA & FA: Principal Component Analysis and Factor Analysis
– Multidimensional Scaling
– t-SNE & UMAP
|ULIM5||Soft Clustering||– LSA: Latent Semantic Analysis (SVD)|
– NMF: Non-Negative Matrix Factorization
– LDA: Latent Dirichlet Allocation
– Network-based models
|ULIM6||Semi-Supervised Clustering||– Introduction to Semi-Supervised Clustering|
– Constrained Based SSC (must link, Not-ink, family link, etc)
– Similarity based
– Hybrid methods on SSC