REG-33806 Data Science for Ecology

Course

Credits 6.00

Teaching methodContact hours
Lecture18
Tutorial20
Group work5
Course coordinator(s)HJ de Knegt
Lecturer(s)J Eikelboom
dr. ir. PA Jansen
HJ de Knegt
dr. D Rozendaal

Language of instruction:

English

Assumed knowledge on:

We assume knowledge on INF-3AD06 Data Science Concepts and Experience with Programming in the R Programming Environment. Students without prior experience with programming in R are advised to follow the online course https://www.coursera.org/learn/r-programming. We assume students have a general understanding of mathematics and statistics. Familiarity with the application of statistical methods to ecological data (e.g., REG-31806 Ecological Methods I; CSA-34306 Ecological Modelling and Data Analysis in R), and algorithms used in data science (e.g., MAT-3AA06 Statistics for Data Scientists; FTE-35306 Machine Learning; GRS-3AB06 Deep Learning in Data Science) is helpful.

Continuation courses:

MSc thesis / internship

Contents:

Advancements in technology and information processing are rapidly changing many fields of plant sciences, animal sciences and ecology, including research, agriculture and conservation. For example, distributed sensor networks currently allow for the acquisition of huge volumes of data on many relevant aspects, ranging from soil and vegetation characteristics, abiotic conditions like weather, to the behaviour of animals. The availability of unprecedented amounts of data is unlocking potential, however, it also creates a major challenge: the ability to effectively process and analyse it. In the current data-centered digital era that is driven by technological change, the volume of data will continue to skyrocket due to decreasing costs of data collection, storage and processing. Fostered by these technological developments, researchers and various branches of business are increasingly embracing data science: a concept to unify data processing, statistics, artificial intelligence and their related algorithms to extract knowledge from data. Hence, data science is increasingly becoming an integral part of decision making in many fields, including precision agriculture, livestock management and nature conservation, as it fosters automated prediction and classification (e.g.: is this animal ill?, is this plant a weed?, is this apple ready to pick?, when should we harvest?).

To keep up with these technological developments, students need to become acquainted with the terms, concepts and methodology accompanying these developments. This is especially important since it can require a different approach to using data and conducting science than the approaches they are familiar with. Namely, the large volumes of data usually come from various sources, each with their own characteristics, uncertainties and measurement errors. The data from these different sources need to be integrated, and the inherent heterogeneity should be accounted for. Moreover, the collected sensor data are generally not immediately fit for analyses, so that pre-processing of the raw data is needed. After initial data pre-processing, the engineering of informative and discriminating features (i.e., measurable properties of the phenomenon being observed) is a crucial step for creating effective algorithms. Furthermore, the collection of large volumes of data leads to a shift away from frequentist hypothesis testing towards analytics that is more focussed on prediction, classification, pattern recognition or anomaly detection. To this end, machine learning techniques are often used, usually by high performance computing.

This course covers the main elements of using a data science approach to solving agricultural or ecological problems. The students will be guided through the main concepts and skills that are required to become a successful data scientist working in ecology. These skills relate to three pillars of data science expertise: (1) mathematics and statistics; (2) computer science and programming; and (3) domain knowledge, i.e., the understanding of patterns and processes governing (agro-)ecological systems. Hence, this course builds upon, and expands, the understanding and skills generated in other courses, and focuses on combining these in an interdisciplinary way to be optimally able to solve (agro-) ecological problems with a data-driven approach. Approaches to solving common (agro-)ecological problems will be discussed, as well as the common problems to the associated data: the usually large degrees of spatial-temporal (auto)correlation and the non-independence between individuals. Methods to deal with these issues will be discussed, including algorithms that specifically account for these issues.

During the course, students will increase their knowledge and skills via hands-on experience where the taught principles and methods are put into practice. Using large datasets from current cutting-edge science projects (e.g., data gathered about animal behaviour via wearable sensors such as GPS and inertial measurement units, or data about vegetation via airborne or ground-based spectral sensors), different steps in the data science lifecycle will be covered and practiced: from problem definition; data management, cleaning and pre-processing; data exploration; feature engineering; selecting and training algorithms; optimizing hyperparameters; validating algorithms; testing predictions; to visualization and communication of results. The students will be trained to apply different machine learning techniques, and critically evaluate their merits. During the course, students will acquire and expand data science skills that will prepare them for a quantitative MSc thesis, and that will benefit their future career in academia or business.

Learning outcomes:

After successful completion of this course students are expected to be able to:
- understand important concepts in data science needed to solve typical ecological problems;
- understand how key features of ecological data (e.g., spatial-temporal (auto) correlation and non-independence between individuals) influence the selection, training, validation and evaluation of algorithms;
- identify and select machine learning algorithms appropriate to specific ecological problems;
- apply data science skills (data processing techniques, feature engineering, and machine learning algorithms) to analyse ecological datasets;
- evaluate the results and performance of trained algorithms, and critically assess the reliability and adequacy of trained algorithms in predicting ecological phenomena;
- create ecological insight from data using a data science approach.

Activities:

The course contains three parts:
- lectures that cover the essential theory and concepts of data science for ecology;
- tutorials where the students put the theory and concepts into practice, through supervised exercises, thereby expanding their understanding and skills (e.g. programming, data processing, feature engineering, application of machine learning algorithms) while working on typical ecological applications of data science (animal tracking, camera-trap analysis, pattern recognition on crops);
- group work where small groups of students will work together on a data challenge with a real ecological dataset. the challenge is to use the acquired knowledge and skills to produce a workflow (e.g., process data, select an algorithm, train and test the algorithm, communicate methods and results) that achieves the highest possible predictive performance and leads to ecological insight.

Examination:

Examination will consist of three parts that will be marked separately (each part should be completed with a minimum mark of 5.5);
- an individual written examination with open and multiple choice questions on general principles in data science for ecological applications as covered in the lectures (25%);
- an individual computer aided examination through assignments that test the acquired skills regarding the application of data science methods to solving ecological problems (25%);
- a group-based examination based on the group work (execution of the project, data analysis, and presentation) (50%).

ProgrammePhaseSpecializationPeriod
Restricted Optional for: MBIBiologyMSc5MO
MFNForest and Nature ConservationMSc5MO
MPSPlant SciencesMSc5MO