This ESSLLI 2017 course covers introductory topics in unsupervised machine learning with specific focus on the analysis of experimental linguistics and corpus data. All sessions will be in an interactive tutorial format, using the Jupyter Notebook platform, the Python SciPy stack, and curated datasets.

What is unsupervised machine learning?

In most introductory statistics courses, students are introduced to some form of supervised machine learning, generally through the framework of (generalized) linear models. In supervised machine learning, we are given both inputs (independent variables) and outputs (dependent variables), and we are interested in learning some function that allows us to predict future outputs given future inputs.

For instance, suppose we are interested in trying to predict whether a noun is count or mass based on how often that noun is found in its plural form. In the supervised learning paradigm, our learning algorithm might take as input the proportion of times each noun is found in the plural and an indicator of whether it is count or mass. It would then output a function from that proportion to that indicator that we could use to classify future nouns as count or mass based on the proportion of times they are observed in the plural.

The supervised learning paradigm is extremely useful when we have access to both inputs and outputs, but getting data labeled with outputs is not always feasible. In these cases, it can often be useful to turn to unsupervised machine learning. In unsupervised machine learning, we are given only a set of inputs, and we aim to find patterns in those inputs that we have reason to believe correspond to some distinction of interest.

Returning to the count noun v. mass noun example, suppose we have access only to how often various nouns are found in their plural form, but not whether they are count or mass. In the unsupervised learning paradigm, we might provide as input to a learning algorithm the proportion of times each noun is found in the plural and ask it to cluster nouns into two categories with the aim of inducing the count-mass distinction. The result in this case would again be a function from inputs to outputs, but instead of supervising the algorithm by giving it labels, the algorithm must induce those labels using the patterns it finds in the data.

Who might take this course?

This course is targeted at students and practitioners interested in learning techniques (i) that allow them to more effectively explore their experimental and/or corpus data and (ii) that they can build on for learning more advanced techniques currently used in computational linguistics and natural language processing. A familiarity with some high-level programming language (e.g. Python, R, Ruby, Java, etc.) will be useful but not strictly necessary. For those already proficient in the methods covered in this course, I am also co-teaching an ESSLLI 2017 course with Kyle Rawlins, entitled Computational lexical semantics, which covers more advanced topics.


Aaron Steven White
Science of Learning Institute
Department of Cognitive Science
Center for Language and Speech Processing
Johns Hopkins University


Dates July 23-28, 2017
Time 11-12:30
Place University of Toulouse
Room TBA


Day 1

Data manipulation using the SciPy Stack

This session introduces participants to the Jupyter Notebook platform and the Python SciPy stack. Particular focus will be given to basic data manipulation using NumPy, SciPy, and pandas and basic data visualization using matplotlib and ggplot2 (using the rpy2 ipython interface). A brief introduction will also be given to the scikit-learn API, which will be used heavily over the remainder of the course.

Day 2

Category Induction I (Hard Clustering)

This session introduces participants to basic methods for partitioning a set of objects into discrete categories — i.e. hard clustering methods — including centroid-based methods (e.g. k-means), linkage-based methods (e.g. various forms of hierarchical clustering), and density-based methods (e.g. unsupervised nearest neighbors, DBSCAN).

In this session, as well as the remaining sessions, the focus will be on providing participants with an intuition for what these methods aim to do and when to use them, as opposed to providing a mathematically rigorous understanding.

Day 3

Category Induction II (Soft Clustering with Mixture Models)

This session introduces participants to basic methods for associating objects with prototype-theoretic categories — i.e. soft clustering methods. Particular focus will be given to the concept of a mixture model, though the relationship to non-probabilistic methods will also be discussed.

Day 4

Feature Induction I (Matrix Factorization)

This session introduces participants to basic methods for learning unobserved features of objects, including factor analysis methods (e.g. principal component analysis, independent component analysis) and non-negative matrix factorization. Focus will be given to the concept of sparsity and its importance in learning good linguistic representations.

Day 5

Feature Induction II (Matrix Factorization and Manifold Learning)

This session introduces participants to more advanced methods for learning unobserved features of objects, including (nonmetric) multidimensional scaling, kernelized principal component analysis, isometric mapping, and (t-distributed) stochastic nearest neighbors.