Overview

This ESSLLI 2017 course covers introductory topics in unsupervised machine learning with specific focus on the analysis of experimental linguistics and corpus data. All sessions will be in an interactive tutorial format, using the Jupyter Notebook platform, the Python SciPy stack, and curated datasets.

What is unsupervised machine learning?

In most introductory statistics courses, students are introduced to some form of supervised machine learning, generally through the framework of (generalized) linear models. In supervised machine learning, we are given both inputs (independent variables) and outputs (dependent variables), and we are interested in learning some function that allows us to predict future outputs given future inputs.

For instance, suppose we are interested in trying to predict whether a noun is count or mass based on how often that noun is found in its plural form. In the supervised learning paradigm, our learning algorithm might take as input the proportion of times each noun is found in the plural and an indicator of whether it is count or mass. It would then output a function from that proportion to that indicator that we could use to classify future nouns as count or mass based on the proportion of times they are observed in the plural.

The supervised learning paradigm is extremely useful when we have access to both inputs and outputs, but getting data labeled with outputs is not always feasible. In these cases, it can often be useful to turn to unsupervised machine learning. In unsupervised machine learning, we are given only a set of inputs, and we aim to find patterns in those inputs that we have reason to believe correspond to some distinction of interest.

Returning to the count noun v. mass noun example, suppose we have access only to how often various nouns are found in their plural form, but not whether they are count or mass. In the unsupervised learning paradigm, we might provide as input to a learning algorithm the proportion of times each noun is found in the plural and ask it to cluster nouns into two categories with the aim of inducing the count-mass distinction. The result in this case would again be a function from inputs to outputs, but instead of supervising the algorithm by giving it labels, the algorithm must induce those labels using the patterns it finds in the data.

Who might take this course?

This course is targeted at students and practitioners interested in learning techniques (i) that allow them to more effectively explore their experimental and/or corpus data and (ii) that they can build on for learning more advanced techniques currently used in computational linguistics and natural language processing. A familiarity with some high-level programming language (e.g. Python, R, Ruby, Java, etc.) will be useful but not strictly necessary. For those already proficient in the methods covered in this course, I am also co-teaching an ESSLLI 2017 course with Kyle Rawlins, entitled Computational lexical semantics, which covers more advanced topics.

What you’ll need

Software

In this course, we’ll be using Python 3 and, to a smaller extent, R on the Jupyter notebook platform. It’s possible to follow along by viewing the github-rendered version of these notebooks (available once the course begins), but you’ll get the most out of the hands-on aspects of the course by actually interacting with the notebooks and associated data, so if you don’t already have Python 3 and Jupyter installed, I’d encourage you to do that now.

Installing Python 3

If you’re new to Python and aren’t using a Linux system, I’d suggest installing Anaconda, which comes preinstalled with all the software packages you’ll need for the course (listed below). If you’re already running a Linux system, there’s a very high probability that your system came preinstalled with at least a Python 2 distribution and probably also a Python 3 distribution—Debian and derivatives come with both—so you’ll likely only need to install the packages required for the course (including Jupyter).

Installing the scipy stack

We’ll be making heavy use of the scipy stack for this course. If you decided to use Ananconda, you’re all set in this regard, since the scipy stack comes preinstalled. Otherwise, follow the installation instructions on the scipy website or run the following commands in your terminal.

python3 -m pip3 install --upgrade pip3
pip3 install --user numpy scipy matplotlib ipython jupyter pandas sympy nose

Installing other python packages

Beyond the scipy stack, you’ll also need some packages built on top of it—in particular: nltk, scikit-learn, seaborn, bokeh, rpy2, and (for a very small portion of the class) pydotplus.

If you went the Anaconda route, all the above except rpy2 and pydotplus are already installed. To install these using Ananconda’s conda package manager, run the following in a terminal.

conda install -c r rpy2=2.8.5 conda-forge pydotplus=2.0.2

If you’re installing everything from scratch, the most straightforward way to install these is to use pip3.

pip3 install --user nltk scikit-learn seaborn bokeh rpy2 pydotplus

Installing R

We’ll also be using some R in this course. If you installed Anaconda, you can install R, plus a bunch of useful packages, using conda by running the following in a terminal.

conda install -c r r-essentials

Otherwise, you should install R as well as the tidyverse suite of packages, the devtools package, and gganimate. To install tidyverse, devtools, and ellipse, open an R interactive session by calling R in a terminal.

R

Then run the install.packages() command in that session.

install.packages(c("tidyverse", "devtools", "ellipse"))

To install gganimate, run the following that same R session.

devtools::install_github("dgrtwo/gganimate")

Installing other software

For a very small portion of the class, it will be useful to have graphviz and ImageMagick installed.

Data

We’ll be using a variety of datasets and digital lexicons in this course. Here’s a list with links. Version numbers link directly to the dataset itself.

The JHU Decompositional Semantics Initiative datasets and the experimental datasets will be packaged with the jupyter notebook for the relevant day.

I’d suggest downloading WordNet and VerbNet through NLTK’s interface. First, open an interactive python session.

ipython3

Then run the following in that session.

import nltk
nltk.download()

This will open a window where you can select the datasets to download. WordNet and VerbNet can be found in the corpus tab.

Instructor

Aaron Steven White
Department of Linguistics
Goergen Institute for Data Science
Department of Computer Science
Department of Brain & Cognitive Sciences
University of Rochester
aaron.white@rochester.edu

Logistics

Dates July 24-28, 2017
Time 11-12:30
Place Université Toulouse (Arsenal campus)
Room Amphithéâtre B Pierre-Hébraud

Schedule

Day 1

Topic
Data analysis and machine learning in jupyter

Content
This session introduces participants to unsupervised learning through the Jupyter Notebook platform and the Python SciPy stack. Particular focus will be given to basic data manipulation using NumPy, SciPy, and pandas and basic data visualization using matplotlib and ggplot2 (using the rpy2 ipython interface). A brief introduction will also be given to the scikit-learn API, which will be used heavily over the remainder of the course.

Notebook
Notebook and datasets (.zip) and online viewer

Day 2

Topic
Category induction

Content
This session introduces participants to basic methods for assigning a set of objects to mutually exclusive categories. Particular focus will be given to simple mixture models and associated methods for analyzing these models.

Notebook
Notebook and datasets (.zip) and online viewer

Day 3

Topic
Basic feature induction

Content
This session introduces participants to basic methods for inducing objects’ unobserved features. Particular focus will be given to principal component analysis and non-negative matrix factorization, along with associated methods for analyzing these models. A major goal of this session will be to show that category induction methods can be viewed as feature induction under a particular set of constraints and thereby motivate a focus on feature induction for the remainder of the course.

Notebook
Notebook and datasets (.zip) and online viewer

Day 4

Topic
Multiview feature induction

Content
This session introduces participants to more advanced methods for learning unobserved features of objects by combining multiple data sources into a single representation. Particular focus will be given to canonical correlation analysis and related methods.

Notebook
Notebook and datasets (.zip) and online viewer

Day 5

Topic
Structured feature induction

Content
This session introduces participants to methods for learning unobserved features of objects where either the objects, the features, or both have some structure. Particular focus will be given to kernel methods and deep autoencoders.

Notebook
Notebook and datasets (.zip) and online viewer

Acknowledgments

Materials in this course are based on research funded in part by DARPA LORELEI, NSF INSPIRE BCS-1344269 (Gradient Symbolic Computation), NSF DDRIG BCS-1456013 (Doctoral Dissertation Research: Learning Attitude Verb Meanings), and the JHU Science of Learning Institute.