MAT-32806 Statistics for Data Scientists


Credits 6.00

Teaching methodContact hours
Course coordinator(s)prof. dr. FA van Eeuwijk
ir. SLGE Burgers
Lecturer(s)dr. B Engel
dr. JA Hageman

Language of instruction:


Assumed knowledge on:

MAT-20306 Advanced Statistics or MAT-22306 Quantitative Research Methodology and Statistics or MAT-24306 Advanced Statistics for Nutritionists


In many areas of biological, environmental and social science new tools and strategies are developed to measure multiple features at subjects and objects of interest. Typical for these new types of data is that they occur in large volumes, are high dimensional and occur at various levels in a hierarchy of data types. For example, in genetics (humans, animals, plants) data can be available at the levels of DNA, RNA, proteins, metabolites and all kinds of phenotypes. In food science, the effects of diet and lifestyle variables can be investigated with respect to physical and mental indicators of performance and well-being. In social science, indicators of economic success can be studied in relation to education, socio economic variables, psychological and lifestyle variables, and social media behavior. These new data require new techniques for analysis and visualization which are provided by a new science that combines elements of statistics, mathematics, computer science and substantive knowledge: Data Science.

Statistics takes a central place in Data Science as it offers a general framework for model building, inference and evaluation in a wide range of data science applications. Statistics provides strategies for evaluation of reliability of outcomes of data analyses, also when these results are obtained by techniques outside the classical statistical domain of e.g. regression and analysis of variance. Modern statistics presents powerful techniques for the analysis of contemporary data like penalized regressions (e.g. ridge, lasso, elastic net) and Bayesian hierarchical methods (e.g. horseshoe) for addressing high dimensional data.

The main objective of the course Statistics for Data Scientists is to develop the skills to build and evaluate modelling strategies for big and/or high dimensional data (including spatial and temporal data) as occurring in the application areas relevant to the Wageningen disciplines. Case studies will serve to illustrate strategies for model building and evaluation and to operationalize the concepts of observational versus experimental data, causal and network modelling, dimension reduction, sparsity, hierarchical modelling, and penalization.

Models, model classes and their mutual connections that will be presented and applied include: mixed modelling, generalized linear and additive modelling, Bayesian modelling, graphical modelling, lasso, ridge, elastic net, support vector machines. All analyses will be performed with the statistical programming language R.

Learning outcomes:

After succesful completion of this course students are expected to be able to:

- explain and compare a broad range of modern statistical methods in data science

- select an appropriate data analysis method based on the characteristics of the data

- apply data analysis methods for data science (in R)

- evaluate the reliability of the outcomes of an analysis

- interpret, visualize and communicate results from data analysis to a multidisciplinary data science team


written exam and an assignment


To be announced

Restricted Optional for: MBIBiologyMSc5AF
MFNForest and Nature ConservationMSc5AF
MPSPlant SciencesMSc5AF
MPBPlant BiotechnologyMSc5AF
MNHNutrition and HealthMScA: Nutritional and Public Health Epidemiology5AF
MNHNutrition and HealthMScC: Molecular Nutrition and Toxicology5AF
MNHNutrition and HealthMScD: Sensory Science5AF