Abstract: The Python ecosystem has many libraries for natural language processing (NLP), which can make it confusing to get started analyzing text as data. This workshop will introduce spaCy as a powerful, opinionated library for NLP that facilitates analysis of text data, along with textacy, a library that adds information retrieval and corpus analysis features.
By completing this workshop, you will develop core skills in asking questions of text and identifying interesting features through spaCy's tokenization, part-of-speech tagging, and named entity recognition. You will also learn to expand that analysis and scale it to many documents through textacy.
Lesson 1: Working with a single document, perform word and sentence tokenization, part-of-speech tagging, and named entity recognition while forming analytical questions.
Lesson 2: Working with a small set of documents and the textacy library, learn to extract information at corpus level based on the same grammatical features identified in lesson 1.
Participants should have a reasonable grasp of basic Python syntax, including control flow, functions, and list operations. Knowledge of English syntax, such as parts of speech, will be helpful but not necessary for successful participation.
Bio: Scott Bailey is the Digital Research and Scholarship Librarian at the NC State University Libraries, where he collaborates with faculty and other scholars in applying digital and computational tools and methods to open new possibilities in research and learning. He regularly teaches workshops using programming languages like Python and R to introduce data analysis and visualization, machine learning, and computational approaches to text data. He was previously the Head of Social Science Data and Software in the Center for Interdisciplinary Digital Research (CIDR) at Stanford Libraries, where he oversaw a group of Ph.D. students in delivering expert consultation on statistical computing, organized and taught in the CIDR workshop series, and collaborated with colleagues across Stanford University to provide better access to data and support for data-driven research.