Abstract: During this tutorial I will show you how to use Snakemake to create a scalable and reproducible data analysis pipeline. Snakemake is a workflow management system that uses sets of rules to define steps in the analysis process and it integrates smoothly with server, cluster, or cloud environments to allow easy scaling. Each rule defines the input files, output files, and the steps to get from input to output and wildcards allow the rules to be scaled. Snakemake is a very general framework for creating pipelines and I have used it for bioinformatics applications. I will walk through an example pipeline to analyze RNA-sequencing and discuss a program for which I have used Snakemake. The program’s goal is to identify genetically engineered organisms in complex samples. The pipeline includes alignment of DNA sequences, assembling sequences that contain signs of engineering into larger constructs, and creating visualizations.
Part 1: The foundations of Snakemake
We will begin by covering when a data analysis pipeline is useful, when Snakemake in particular is the right tool for the job then move into the mechanics of Snakemake itself. I’ll cover the basics of how to set up a pipeline along with several useful features that increase flexibility and scalability.
Part 2: An Example Pipeline to analyze RNA-seq data
I have published a pipeline of intermediate complexity that performs RNA-seq analysis (https://github.com/laseaman/odsc_snakemake_tutorial). I will walk through the pipeline focusing on the syntax and problem formulation.
Part 3: A real life example for identifying genetic engineering
FELIX is a program I have worked on for the last two and a half years with the goal of identifying genetically engineered organisms in complex samples. For Felix, I use Snakemake to create pipelines that have been deployed onto clusters to process 100 samples (100s of Gb of data) and perform over a thousand hours of compute time in about two days. The pipeline includes alignment of DNA sequences, assembling sequences that contain signs of engineering into larger constructs, identifying the makeup of those constructs, and creating visualizations so that subject matter experts can quickly identify if a sample has been genetically engineered or not.
python, anaconda, and/or bioinformatics background would be helpful but definitely not required!
Bio: Laura Seaman is a Senior Machine Intelligence Scientist at Draper where she applies machine learning and bioinformatics algorithms to a variety of applications. Dr. Seaman’s graduate work focused on using genetic data to identify alterations to the genomic structure in cancer. She is currently using data science for many applications including analysis of financial networks and identification of genetically engineered organisms. Dr. Seaman has a Bachelor of Science in Biological Engineering from the Massachusetts Institute of Technology, a Masters of Arts in Statistics from the University of Michigan, and a Doctor of Philosophy in Bioinformatics from the University of Michigan.
Laura A. Seaman, PhD
Machine Intelligence Scientist | Draper