Overview

This ESSLLI 2017 course covers advanced topics in computational lexical semantics with specific focus on using theoretically-informed computational models to combine data from experimental linguistics with corpus data. All sessions will be in an interactive tutorial format, using the Jupyter Notebook platform, the Python SciPy stack, and curated datasets.

Who might take this course?

This course is targeted at students and practitioners interested in learning how to apply machine learning techniques for doing lexical semantic analysis. Familiarity with some high-level programming language (e.g. Python, R, Ruby, Java, etc.) will be useful but not strictly necessary.

Aaron is teaching an ESSLLI 2017 course concurrently, entitled Unsupervised methods for linguistic data, which covers introductory topics in unsupervised machine learning. We strongly recommend taking this course concurrently if you have little or no prior experience in applying machine learning techniques – especially unsupervised learning techniques – to linguistic data.

What you’ll need

Software

In this course, we’ll be using Python 3 and, to a smaller extent, R on the Jupyter notebook platform. It’s possible to follow along by viewing the github-rendered version of these notebooks (available once the course begins), but you’ll get the most out of the hands-on aspects of the course by actually interacting with the notebooks and associated data, so if you don’t already have Python 3 and Jupyter installed, I’d encourage you to do that now.

Installing Python 3

If you’re new to Python and aren’t using a Linux system, I’d suggest installing Anaconda, which comes preinstalled with all the software packages you’ll need for the course (listed below). If you’re already running a Linux system, there’s a very high probability that your system came preinstalled with at least a Python 2 distribution and probably also a Python 3 distribution—Debian and derivatives come with both—so you’ll likely only need to install the packages required for the course (including Jupyter).

Installing the scipy stack

We’ll be making heavy use of the scipy stack for this course. If you decided to use Ananconda, you’re all set in this regard, since the scipy stack comes preinstalled. Otherwise, follow the installation instructions on the scipy website or run the following commands in your terminal.

python3 -m pip3 install --upgrade pip3
pip3 install --user numpy scipy matplotlib ipython jupyter pandas sympy nose

Installing other python packages

Beyond the scipy stack, you’ll also need some packages built on top of it—in particular: nltk, scikit-learn, seaborn, bokeh, and rpy2.

If you went the Anaconda route, all the above except rpy2 are already installed. To install rpy2 using Ananconda’s conda package manager, run the following in a terminal.

conda install -c r rpy2=2.8.5

If you’re installing everything from scratch, the most straightforward way to install these is to use pip3.

pip3 install --user nltk scikit-learn seaborn bokeh tensorflow

Installing R

We’ll also be using some R in this course. If you installed Anaconda, you can install R, plus a bunch of useful packages, using conda by running the following in a terminal.

conda install -c r r-essentials

Otherwise, you should install R as well as the tidyverse suite of packages. To install tidyverse, open an R interactive session by calling R in a terminal.

R

Then run the install.packages() command in that session.

install.packages("tidyverse")

Data

We’ll be using a variety of datasets and digital lexicons in this course. Here’s a list with links. Version numbers link directly to the dataset itself.

We’d suggest downloading WordNet and VerbNet through NLTK’s interface. First, open an interactive python session.

ipython3

Then run the following in that session.

import nltk
nltk.download()

This will open a window where you can select the datasets to download. WordNet and VerbNet can be found in the Corpus tab.

Instructor

Aaron Steven White
Department of Linguistics
Goergen Institute for Data Science
Department of Computer Science
Department of Brain & Cognitive Sciences
University of Rochester
aaron.white@rochester.edu

Kyle Rawlins
Department of Cognitive Science
Johns Hopkins University
kgr@jhu.edu

Logistics

Dates July 24-28, 2017
Time 17-18:30
Place Université Toulouse (Arsenal campus)
Room Amphithéâtre C Montané-de-la-Roque

Schedule

Day 1

Topic
Computational lexical semantics in jupyter

Content
This session gives a broad over view of computational lexical semantics and introduces participants to the Jupyter Notebook platform and the Python SciPy stack. Particular focus will be given to basic manipulation of experimental and corpus data using NumPy, SciPy, and pandas and basic data visualization using matplotlib and ggplot2 (using the rpy2 ipython interface).

Notebook
Notebook and datasets (.zip), online viewer, and day1/slides

Day 2

Topic
Computational models of syntactic distribution (part 1)

Content
This session and the following introduce participants to the computational analysis of syntactic distribution. Particular focus will be given to supervised learning methods for relating acceptability and corpus frequency and unsupervised methods for constructing abstractions of syntactic distribution in the form of vector-based representations.

Notebook
Notebook and datasets (.zip), online viewer, and slides

Day 3

Topic
Computational models of syntactic distribution (part 2)

Content
This session and the previous introduce participants to the computational analysis of syntactic distribution. Particular focus will be given to supervised learning methods for relating acceptability and corpus frequency and unsupervised methods for constructing abstractions of syntactic distribution in the form of vector-based representations.

Notebook
Notebook and datasets (.zip) and slides

Day 4

Topic
Computational models of argument linking

Content
This session introduces participants to the computational analysis of argument linking. Particular focus will be given to supervised learning methods for relating thematic role annotations to syntactic annotations and unsupervised methods for inducing thematic roles.

Notebook
Notebook and datasets (.zip) and slides

Day 5

Topic
Computational models of entailment and inference

Content
This session introduces participants to modern models of entailment and inference. Particular focus will be given to assessing the predictability of veridicality and factivity given syntactic features.

Notebook
Notebook and datasets (.zip), online viewer, and slides (lecture, wrap-up)

Acknowledgments

Materials in this course are based on research funded in part by DARPA LORELEI, NSF INSPIRE BCS-1344269 (Gradient Symbolic Computation), NSF DDRIG BCS-1456013 (Doctoral Dissertation Research: Learning Attitude Verb Meanings), and the JHU Science of Learning Institute.