Technical course: Web data collection and analysis

About
Whereas the website has interactive components or a fully fledged API, this course will teach you how to programmatically extract data from the web and which models can be applied to draw conclusions from these data. As case-studies, we develop a PPP exchange rate using only real estate rent prices on some countries in Latin America and we study migration flows using Facebook Marketing API.

Subject matter
The use of Big Data is accelerating within the development and humanitarian practice. If used right, its implementation can foster inclusion, efficiency, and lower project costs which may benefit public and private organizations involved on development programs. Therefore, our technical courses cover different aspects on data science and data engineering relevant for the context of official statistics and sustainable development.

Methodology
All the programing material is provided in Python using the conventional Open Source libraries for Data Science. Most of the sessions are interactive and on a Jupyter Notebook (.ipynb). A practical exercise is completed at the end of each session.

Format and instructors
This course is offered face-to-face (or via videoconference if necessary), it has a duration of 18 hours ideally distributed along 3 days, and is designed for 20 participants. Each course is delivered by a team of 2 training specialists.

Requirements
Some programming experience is required; Python is preferable though not necessary.

Testimonial from a participant
“The whole course was excellent. It is great having the opportunity to participate in qualifications on modern issues relevant to our work".

Syllabus

A. Collecting through web browser emulator

Modern websites usually have interactive components. We focus on using web browser emulators, namely Selenium web driver, for exploiting those components programmatically.

The use-case of this module is a real-estate rental platform, where rent prices are collected.

The collection methods used for this platform are applicable for several e-commerce websites which present a similar catalogue structure for exhibiting their products.

B. Collecting through API's

Large and medium size web platforms commonly expose their data through web Application Programming Interfaces (API’s). We learn how those can be easily manipulated through general open source libraries as requests or dedicated libraries in the case of large platforms as Facebook with the Facebook Marketing SDK official library.

The Facebook modules were inspired on the amazing work done by our colleagues from Qatar Computing Research Institute at HBKU, UNICEF, MIT Media Lab, iMMAP Colombia and the Global Protection Cluster from UNHCR, entitled “Real-Time Monitoring of the Venezuelan Exodus through Facebook’s Advertising Platform” (see publication here).

C. Analyzing and visualizing web derived data

The analysis of web collected data is challenging because this data is frequently noisy and biased. We study how these challenges can be addressed by adequate modeling techniques. Additionally, visualizing as a means for extracting knowledge from data, is also fundamental when treating with web collected data.

We cover some basic cleaning procedures, different Machine Learning models and also static and interactive visualizations. Everything using the most popular Open Source libraries from Python Data-Science stack.