Technical course: Statistical methods for correcting selection bias

Designed for professionals who have a role in defining how data can drive social progress from a technical perspective, in this course you will learn about the principal techniques for correcting bias through a statistical approach.

Topics

SDG
SDG17: Partnerships for the goals
SDG 17: Systemic Issues
Subject
Data Science for statistical teams
Statistics
Methodology and statistical processes
Big data and geospatial information
Data processing and analysis
Keywords
big data
data collection and manipulation
data ethics
data interpretation
data science

About
Just as questionnaires are the means for observing reality through surveys, electronic platforms have the same role for big data. Most of the big data sources offer a non-probabilistic sample of the population of study, where several errors are induced by self-selection of individuals present on the sample, targeting decisions from the owners of the electronic platform and limitations of the coverage of said platform.

In this course you will learn about the principal techniques for correcting bias through a statistical approach. This work is based on previous research work from Data-Pop Alliance on correcting bias on mobile network data and a fundamental book published by Eurostat.

Subject matter
The use of Big Data is accelerating within the development and humanitarian practice. If used right, its implementation can foster inclusion, efficiency, and lower project costs which may benefit public and private organizations involved on development programs. Therefore, our technical courses cover different aspects on data science and data engineering relevant for the context of official statistics and sustainable development.

Methodology
All the programing material is provided in Python using the conventional Open Source libraries for Data Science.  Most of the sessions are interactive and on a Jupyter Notebook (.ipynb). A practical exercise is completed at the end of each session.

Format and instructors
This course is offered face-to-face (or via videoconference if necessary), it has a duration of 18 hours ideally distributed along 3 days, and is designed for 20 participants. Each course is delivered by a team of 2 training specialists.

Requirements
Some programming experience is required; Python is preferable though not necessary.

Syllabus

A. Main challenges of big data as an statistical source

We define a statistical approach to big data and highlight the main challenges and opportunities it presents when trying to use it as a statistical source. We go through the specific difficulties of different big data sources of interest such as mobile network data, bank transactional data and social media among others.

B. Unit-level methods for correcting bias

Correcting selection bias in big data can be analogue to procedures used in other data sources which have the same problem of non-random selection and had been studied for a while: web or telephone opt-in surveys.

We will go through different techniques for correcting selectivity bias at the unit of observation level, which most often will be individuals. Even if methods are analogue to those used in opt-in surveys, what you will learn in this course is how to use those procedures in massive data, by leveraging big data frameworks such as Spark, through its easy-to-use Python API: PySpark.

C. Domain-level methods for correcting bias

Correcting selection bias in big data can be analogue to procedures used in other data sources which have the same problem of non-random selection and had been studied for a while: web or telephone opt-in surveys.

We will go through different techniques for correcting selectivity bias at the domain level, which most often will be individuals. Even if methods are analogue to those used in opt-in surveys, what you will learn in this course is how to use those procedures in massive data, by leveraging big data frameworks such as Spark, through its easy-to-use Python API: PySpark.

Target Audience

This course is aimed at professionals for which programming is part of their daily activities or whom are leading a technical team.

Learning Objectives

Upon completion of the workshop you will be able to:

  1. Understand the challenges big data presents as an statistical source.
  2. Put into practice techniques for addressing selection bias on big data sources.

Related