This ESSLLI 2017 course covers introductory topics in unsupervised machine learning with specific focus on the analysis of experimental linguistics and corpus data. All sessions will be in an interactive tutorial format, using the Jupyter Notebook platform, the Python SciPy stack, and curated datasets.
What is unsupervised machine learning?
In most introductory statistics courses, students are introduced to some form of supervised machine learning, generally through the framework of (generalized) linear models. In supervised machine learning, we are given both inputs (independent variables) and outputs (dependent variables), and we are interested in learning some function that allows us to predict future outputs given future inputs.
For instance, suppose we are interested in trying to predict whether a noun is count or mass based on how often that noun is found in its plural form. In the supervised learning paradigm, our learning algorithm might take as input the proportion of times each noun is found in the plural and an indicator of whether it is count or mass. It would then output a function from that proportion to that indicator that we could use to classify future nouns as count or mass based on the proportion of times they are observed in the plural.
The supervised learning paradigm is extremely useful when we have access to both inputs and outputs, but getting data labeled with outputs is not always feasible. In these cases, it can often be useful to turn to unsupervised machine learning. In unsupervised machine learning, we are given only a set of inputs, and we aim to find patterns in those inputs that we have reason to believe correspond to some distinction of interest.
Returning to the count noun v. mass noun example, suppose we have access only to how often various nouns are found in their plural form, but not whether they are count or mass. In the unsupervised learning paradigm, we might provide as input to a learning algorithm the proportion of times each noun is found in the plural and ask it to cluster nouns into two categories with the aim of inducing the count-mass distinction. The result in this case would again be a function from inputs to outputs, but instead of supervising the algorithm by giving it labels, the algorithm must induce those labels using the patterns it finds in the data.
Who might take this course?
This course is targeted at students and practitioners interested in learning techniques (i) that allow them to more effectively explore their experimental and/or corpus data and (ii) that they can build on for learning more advanced techniques currently used in computational linguistics and natural language processing. A familiarity with some high-level programming language (e.g. Python, R, Ruby, Java, etc.) will be useful but not strictly necessary. For those already proficient in the methods covered in this course, I am also co-teaching an ESSLLI 2017 course with Kyle Rawlins, entitled Computational lexical semantics, which covers more advanced topics.
In this course, we’ll be using Python 3 and, to a smaller extent, R on the Jupyter notebook platform. It’s possible to follow along by viewing the github-rendered version of these notebooks (available once the course begins), but you’ll get the most out of the hands-on aspects of the course by actually interacting with the notebooks and associated data, so if you don’t already have Python 3 and Jupyter installed, I’d encourage you to do that now.
Installing Python 3
If you’re new to Python and aren’t using a Linux system, I’d suggest installing Anaconda, which comes preinstalled with all the software packages you’ll need for the course (listed below). If you’re already running a Linux system, there’s a very high probability that your system came preinstalled with at least a Python 2 distribution and probably also a Python 3 distribution—Debian and derivatives come with both—so you’ll likely only need to install the packages required for the course (including Jupyter).
Installing the scipy stack
We’ll be making heavy use of the scipy stack for this course. If you decided to use Ananconda, you’re all set in this regard, since the scipy stack comes preinstalled. Otherwise, follow the installation instructions on the scipy website or run the following commands in your terminal.
Installing other python packages
If you went the Anaconda route, all the above except
pydotplus are already installed. To install these using Ananconda’s
conda package manager, run the following in a terminal.
If you’re installing everything from scratch, the most straightforward way to install these is to use
We’ll also be using some R in this course. If you installed Anaconda, you can install R, plus a bunch of useful packages, using
conda by running the following in a terminal.
Otherwise, you should install R as well as the
tidyverse suite of packages, the
devtools package, and
gganimate. To install
ellipse, open an R interactive session by calling
R in a terminal.
Then run the
install.packages() command in that session.
gganimate, run the following that same R session.
Installing other software
We’ll be using a variety of datasets and digital lexicons in this course. Here’s a list with links. Version numbers link directly to the dataset itself.
- WordNet v3.1
- VerbNet v3.2
- English Universal Dependencies v2
- JHU Decompositional Semantics Initiative datasets
- Experimental datasets
The JHU Decompositional Semantics Initiative datasets and the experimental datasets will be packaged with the jupyter notebook for the relevant day.
I’d suggest downloading WordNet and VerbNet through NLTK’s interface. First, open an interactive python session.
Then run the following in that session.
This will open a window where you can select the datasets to download. WordNet and VerbNet can be found in the corpus tab.
Aaron Steven White
Department of Linguistics
Goergen Institute for Data Science
Department of Computer Science
Department of Brain & Cognitive Sciences
University of Rochester
|Dates||July 24-28, 2017|
|Place||Université Toulouse (Arsenal campus)|
|Room||Amphithéâtre B Pierre-Hébraud|
Data analysis and machine learning in jupyter
This session introduces participants to unsupervised learning through the Jupyter Notebook platform and the Python SciPy stack. Particular focus will be given to basic data manipulation using NumPy, SciPy, and pandas and basic data visualization using matplotlib and ggplot2 (using the rpy2 ipython interface). A brief introduction will also be given to the scikit-learn API, which will be used heavily over the remainder of the course.
This session introduces participants to basic methods for assigning a set of objects to mutually exclusive categories. Particular focus will be given to simple mixture models and associated methods for analyzing these models.
Basic feature induction
This session introduces participants to basic methods for inducing objects’ unobserved features. Particular focus will be given to principal component analysis and non-negative matrix factorization, along with associated methods for analyzing these models. A major goal of this session will be to show that category induction methods can be viewed as feature induction under a particular set of constraints and thereby motivate a focus on feature induction for the remainder of the course.
Multiview feature induction
This session introduces participants to more advanced methods for learning unobserved features of objects by combining multiple data sources into a single representation. Particular focus will be given to canonical correlation analysis and related methods.
Structured feature induction
This session introduces participants to methods for learning unobserved features of objects where either the objects, the features, or both have some structure. Particular focus will be given to kernel methods and deep autoencoders.
Materials in this course are based on research funded in part by DARPA LORELEI, NSF INSPIRE BCS-1344269 (Gradient Symbolic Computation), NSF DDRIG BCS-1456013 (Doctoral Dissertation Research: Learning Attitude Verb Meanings), and the JHU Science of Learning Institute.