Exploratory Data Analysis (EDA)

EDA is at the heart of any data analysis. EDA process consist of preprocessing, data understanding, drawing initial hypothesis, checking on models’ basic assumptions, and visualizations to find meaningful information. Proper EDA is a prominent step to building an optimal model. When it is communicated effectively through an exceptional data storytelling, EDA can produce impactful data driven decision or policy.

Prerequisites                          : SFDS, DFDS

Objectives/Content               :

  1. Develop familiarity with Python software for data preprocessing and visualizations.
  2. Be proficient and efficient in performing data preprocessing according to the data structures and advanced models that are going to be used.
  3. Able to do data import-export and parsing from local files and databases.
  4. Application of numerical and visual summarization of data.
  5. Illustration of the importance of EDA before embarking on sophisticated model building.
  6. Properly use, interpret, and communicate basic statistics and visualizations.
  7. Creative visualizations or customized visualizations according to the data themes/applications.
  8. Proficient in performing causality analysis from the data.
  9. Developing a guideline in doing proper EDA in their division.

Reference                               :

  1. Cox, V. (2017). Exploratory data analysis. In Translating Statistics to Make Decisions (pp. 47-74). Apress, Berkeley, CA.
  2. DuToit, S. H., Steyn, A. G. W., & Stumpf, R. H. (2012). Graphical exploratory data analysis. Springer Science & Business Media.
  3. Bock, H. H., & Diday, E. (Eds.). (2012). Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer Science & Business Media.
  4. Cleveland, W.S., 1993. Visualizing Data. Hobart Press.
  5. Cleveland, W.S., 1994. The elements of graphing data. Hobart Press.
  6. Few, S., 2009. Now you see it. Analytics Press.
  7. Harris, R.L., 1999. Information Graphics. Oxford University Press.
  8. Healy, K., 2018. Data Visualization: A Practical Introduction. Princeton University Press.
  9. Knaflic, C.N., 2015. Storytelling with Data. Wiley.
  10. Robbins, N.B., 2005. Creating More Effective Graphs. Wiley.
  11. Tufte, E.R., 2001. The Visual Display of Quantitative Information, 2nd ed. Cheshire, CT: Graphics Press.
  12. Tufte, E.R., 1997. Visual Explanations. Cheshire, CT: Graphics Press.
  13. Tufte, E.R., 2006. Beautiful evidence. Cheshire, CT: Graphics Press.
  14. Wainer, H., 2009. Picturing the Uncertain World. Princeton University Press.
  15. Yau, N., 2013. Data Points – Visualization that means something. Wiley.
  16. Huff, D. (1993). How to lie with statistics. WW Norton & Company.
  17. Reinhart, A. (2015). Statistics done wrong: The woefully complete guide. No starch press.
Topic IDTopic TitleLessons
EDA1Data Source and Understanding– Effect of different data sources to the inference/interpretation
– Data Types (Dos and Don’ts)
– Scales of measurements
– Variable roles: descriptor, label, response, confounding, etc
– Frequency distributions
EDA2Missing Value Analysis– Missing values variations
– Data science models and missing values
– Missing at Random
– Imputation methods
EDA3Outlier analysis and treatment– Univariate outlier detection
– One Class Classification
– Multivariate outlier detection
– Distribution based outlier detection
– Time series-based outliers
– Outlier treatment
EDA4EDA Grouped Discussions-Recap discussions of EDA 2,3,6
EDA5Data Visualization– Basic charts and best practice
– Visualizations on different modules (Matplotlib, seaborn, bokeh, plotly)
– Visualization Customization (advanced MatPlotLib & Dataframe manipulation)
– Infographics and visualization creativity
– Visualization on high-dimensional data (manifolds, umap, t-SNE)
EDA6Statistics Fallacies– p-values
– Underpowered statistics
– Sample build-in bias
– Regression to the mean
– Red herrings
– Robust statistics
– Chart fallacies
EDA7EDA Best Practice– Introduction to Efficient EDA
– General EDA guideline
– Pandas Profiling to Boost EDA
– Using interactivity & instant Dashboard to speed-up analysis
– Examples on EDA best (and sub-standard) practices