This ESSLLI 2017 course covers advanced topics in computational lexical semantics with specific focus on using theoretically-informed computational models to combine data from experimental linguistics with corpus data. All sessions will be in an interactive tutorial format, using the Jupyter Notebook platform, the Python SciPy stack, and curated datasets.
Who might take this course?
This course is targeted at students and practitioners interested in learning how to apply machine learning techniques for doing lexical semantic analysis. Familiarity with some high-level programming language (e.g. Python, R, Ruby, Java, etc.) will be useful but not strictly necessary.
Aaron is teaching an ESSLLI 2017 course concurrently, entitled Unsupervised methods for linguistic data, which covers introductory topics in unsupervised machine learning. We strongly recommend taking this course concurrently if you have little or no prior experience in applying machine learning techniques – especially unsupervised learning techniques – to linguistic data.
In this course, we’ll be using Python 3 and, to a smaller extent, R on the Jupyter notebook platform. It’s possible to follow along by viewing the github-rendered version of these notebooks (available once the course begins), but you’ll get the most out of the hands-on aspects of the course by actually interacting with the notebooks and associated data, so if you don’t already have Python 3 and Jupyter installed, I’d encourage you to do that now.
Installing Python 3
If you’re new to Python and aren’t using a Linux system, I’d suggest installing Anaconda, which comes preinstalled with all the software packages you’ll need for the course (listed below). If you’re already running a Linux system, there’s a very high probability that your system came preinstalled with at least a Python 2 distribution and probably also a Python 3 distribution—Debian and derivatives come with both—so you’ll likely only need to install the packages required for the course (including Jupyter).
Installing the scipy stack
We’ll be making heavy use of the scipy stack for this course. If you decided to use Ananconda, you’re all set in this regard, since the scipy stack comes preinstalled. Otherwise, follow the installation instructions on the scipy website or run the following commands in your terminal.
Installing other python packages
If you went the Anaconda route, all the above except
rpy2 are already installed. To install
rpy2 using Ananconda’s
conda package manager, run the following in a terminal.
If you’re installing everything from scratch, the most straightforward way to install these is to use
We’ll also be using some R in this course. If you installed Anaconda, you can install R, plus a bunch of useful packages, using
conda by running the following in a terminal.
Then run the
install.packages() command in that session.
We’ll be using a variety of datasets and digital lexicons in this course. Here’s a list with links. Version numbers link directly to the dataset itself.
- Word embeddings
- Unified PropBank
- English Universal Dependencies v2
- JHU Decompositional Semantics Initiative datasets
We’d suggest downloading WordNet and VerbNet through NLTK’s interface. First, open an interactive python session.
Then run the following in that session.
This will open a window where you can select the datasets to download. WordNet and VerbNet can be found in the Corpus tab.
Aaron Steven White
Department of Linguistics
Goergen Institute for Data Science
Department of Computer Science
Department of Brain & Cognitive Sciences
University of Rochester
|Dates||July 24-28, 2017|
|Place||Université Toulouse (Arsenal campus)|
|Room||Amphithéâtre C Montané-de-la-Roque|
Computational lexical semantics in jupyter
This session gives a broad over view of computational lexical semantics and introduces participants to the Jupyter Notebook platform and the Python SciPy stack. Particular focus will be given to basic manipulation of experimental and corpus data using NumPy, SciPy, and pandas and basic data visualization using matplotlib and ggplot2 (using the rpy2 ipython interface).
Computational models of syntactic distribution (part 1)
This session and the following introduce participants to the computational analysis of syntactic distribution. Particular focus will be given to supervised learning methods for relating acceptability and corpus frequency and unsupervised methods for constructing abstractions of syntactic distribution in the form of vector-based representations.
Computational models of syntactic distribution (part 2)
This session and the previous introduce participants to the computational analysis of syntactic distribution. Particular focus will be given to supervised learning methods for relating acceptability and corpus frequency and unsupervised methods for constructing abstractions of syntactic distribution in the form of vector-based representations.
Computational models of argument linking
This session introduces participants to the computational analysis of argument linking. Particular focus will be given to supervised learning methods for relating thematic role annotations to syntactic annotations and unsupervised methods for inducing thematic roles.
Computational models of entailment and inference
This session introduces participants to modern models of entailment and inference. Particular focus will be given to assessing the predictability of veridicality and factivity given syntactic features.
Materials in this course are based on research funded in part by DARPA LORELEI, NSF INSPIRE BCS-1344269 (Gradient Symbolic Computation), NSF DDRIG BCS-1456013 (Doctoral Dissertation Research: Learning Attitude Verb Meanings), and the JHU Science of Learning Institute.