Center for Public Policy Studies

Lukasz Szymula participated in the 27th Summer School in Social Science Methods at USI Univeristy in Lugano in course Machine Learning for Social Sciences (August 2022)

This introductory course provides an overview of some of the most important machine learning techniques and their social science applications. Those applications can be grouped into several sub-categories:

Preparing data for statistical analysis: Sometimes data are so voluminous that hand-coding them is near-impossible. We can leverage clever computer algorithms to do the coding for us. For example, we could use an artificial neural network to detect if tweets, of which there are millions, come from a social bot or from a legitimate source.

Doing statistical analysis: As social scientists we are used to building models with numerous parametric assumptions. What if we would let algorithms leverage the data to obtain the model for us? That way, we may detect complex contingencies not previously theorized.

Pattern recognition: How do variables hang together and what groups do our cases form in terms of those variables? For example, political parties take positions on numerous issues. Can we group those issues into ideologies? Based on the issues can we place the parties into clusters?

Anomaly detection: Some phenomena such as war are fortunately rare. However, this makes analyzing them challenging. A whole subfield of machine learning is dedicated to the detection of such rare events or anomalies.

Through lectures and group exercises, the course shows applications in each area. After discussing the general principles of machine learning, the course spends three days on discussing supervised machine learning techniques (relevant for application areas 1 and 2), one day on pattern recognition (relevant for application area 3), and one day on anomaly detection (application area 4). Each day, students will learn the intuition behind the techniques, how they can be implemented in R, how should be interpreted, and how they can be applied in the social sciences. The course is designed to minimize the level of mathematical complexity, although students can always look up the details in vignettes made available for the course. Classification as well as regression tasks are considered. In the former, we seek to predict class membership; in the latter, we predict a numeric score. Interpretation is key and we spend a great deal of time on various metrics and their implementations.

The course covers the following algorithms/techniques: (1) k-nearest neighbors; (2) probabilistic learning (including naïve Bayes, linear, and quadratic discriminant analysis); (3) classification and regression trees, random forests, and model trees; (4) regression with regularization; (5) artificial neural network analysis; (6) boosting; (7) principal component analysis; (8) cluster analysis; (9) SMOTE; and (10) support vector machines.