Technical course: Statistical methods for correcting selection bias

About
Just as questionnaires are the means for observing reality through surveys, electronic platforms have the same role for big data. Most of the big data sources offer a non-probabilistic sample of the population of study, where several errors are induced by self-selection of individuals present on the sample, targeting decisions from the owners of the electronic platform and limitations of the coverage of said platform.

In this course you will learn about the principal techniques for correcting bias through a statistical approach. This work is based on previous research work from Data-Pop Alliance on correcting bias on mobile network data and a fundamental book published by Eurostat.

Subject matter
The use of Big Data is accelerating within the development and humanitarian practice. If used right, its implementation can foster inclusion, efficiency, and lower project costs which may benefit public and private organizations involved on development programs. Therefore, our technical courses cover different aspects on data science and data engineering relevant for the context of official statistics and sustainable development.

Methodology
All the programing material is provided in Python using the conventional Open Source libraries for Data Science. Most of the sessions are interactive and on a Jupyter Notebook (.ipynb). A practical exercise is completed at the end of each session.

Format and instructors
This course is offered face-to-face (or via videoconference if necessary), it has a duration of 18 hours ideally distributed along 3 days, and is designed for 20 participants. Each course is delivered by a team of 2 training specialists.

Requirements
Some programming experience is required; Python is preferable though not necessary.

Syllabus

A. Main challenges of big data as an statistical source

We define a statistical approach to big data and highlight the main challenges and opportunities it presents when trying to use it as a statistical source. We go through the specific difficulties of different big data sources of interest such as mobile network data, bank transactional data and social media among others.

B. Unit-level methods for correcting bias

Correcting selection bias in big data can be analogue to procedures used in other data sources which have the same problem of non-random selection and had been studied for a while: web or telephone opt-in surveys.

We will go through different techniques for correcting selectivity bias at the unit of observation level, which most often will be individuals. Even if methods are analogue to those used in opt-in surveys, what you will learn in this course is how to use those procedures in massive data, by leveraging big data frameworks such as Spark, through its easy-to-use Python API: PySpark.

C. Domain-level methods for correcting bias

We will go through different techniques for correcting selectivity bias at the domain level, which most often will be individuals. Even if methods are analogue to those used in opt-in surveys, what you will learn in this course is how to use those procedures in massive data, by leveraging big data frameworks such as Spark, through its easy-to-use Python API: PySpark.