Statistical Foundation for Data Science (SFDS)

SFDS module’s objective is to elaborate the various Statistical Techniques from the very basics and how each technique is employed on a real-world data set to analyze and conclude insights. Statistics and its methods are the backend of Data Science to “understand, analyze and predict actual phenomena”. Machine learning employs different techniques and theories drawn from statistical & probabilistic fields.

Prerequisites                          :  some MFDS & DSBD

Objectives/Content               :

  1. Use of statistical software (Python) to summarize data numerically and visually, and to perform data analysis.
  2. Have a conceptual understanding of the unified nature of statistical inference.
  3. Apply estimation and testing methods (confidence intervals and hypothesis tests) to analyze single variables and the relationship between two variables in order to understand natural phenomena and make data-based decisions.
  4. Model and investigate relationships between two or more variables within a regression framework.
  5. Interpret results correctly, effectively, and in context without relying on statistical jargon.
  6. Critique data-based claims and evaluate data-based decisions.
  7. Understand some key differences when applying basic statistics in large data.
  8. Complete a research project that employs simple statistical inference and modeling techniques.

Reference:

  1. Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis(Vol. 821). John Wiley & Sons.
  2. Agresti, A., 2019. An Introduction to Categorical Data Analysis, 3rd ed. Wiley.
  3. Upton, G.J.G., 2017. Categorical Data Analysis by Example. Wiley.
  4. Good, P.I., 2013. Introduction to Statistics Through Resampling Methods and R, 2nd ed. Wiley.
  5. Good, P.I. and Hardin, J.W., 2012. Common Errors in Statistics (and How to Avoid Them). Wiley.
Topic IDTopic TitleLessons
SFDS1Data Types– Categorical (nominal, ordinal) and Numerical (interval, ratio)
– Dos and Don’ts on variable types
– Time series, spatial, sequential, etc.
– Rule of thumbs (baselining) on time series data
– Unstructured data and simple data representations.
– Panel data
SFDS2Descriptive Statistics and basic Visualizations– Central tendency, dispersion
– Basic visualizations
– Data transformation
– Skewness and kurtosis
– Entropy, cross-tabulation, correlation
– Robust statistics
SFDS4Basic Probability and Empirical distributions– Sample and population
– Perspective data
– Discrete probability mass function (uniform, Poisson, binomial, etc.)
– Continuous probability density functions (uniform, gaussian, exponential, etc)
– Simulations
– Test of normality & heteroscedasticity (univariate – Normal Distribution)
– Empirical distribution and kernel density estimation
SFDS5Statistical Inference 1– Sampling distribution
– Confidence Interval
– Bias and Variance, and the Cramér-Rao bound.
– Hypothesis Testing (Goodness of fit testing)
– Correlation and Basic (logistic) regression
– method of moments and LSE; and distribution test method such as Kolmogorov-Smirnov test
– Central Limit Theorem, Correlation Vs Causation, Hot hand Phenomena
– Fisher & Neyman-Pearson paradigms, and flaws in NHST
SFDS6SFDS Grouped Discussions– Recap discussions of SFDS4 & 5
SFDS7Statistical Inference 2– (various) Correlation analysis
– Simple & Multiple regression
– Regression assumptions (heteroscedasticity, normality, independency, Multicolinearity)
– Robust Regression
– Logistic regression
 Bayesian Thinking 
SFDS8Categorical Data Analysis– Encoding-Decoding (dummy, one-hot, etc)
– Contingency tables
– Proportions: testing and power
– Categorical distance and correlations
– Ordinal data analysis
SFDS9Fundamental Statistics for Big Data– Correlation and Significance testing in large data.
– Visualization in large and high-dimensional data
– Basic inference in big data

SFDS10Statistical Learning Theory– Learning Problems
– Empirical Risk Minimization
– Consistency of Learning Process
– Bounds on the rate of convergence of learning process (VC dimension)
– Structural risk minimization
– Hyperparameter Optimization

Leave a Reply