
Abstract: Accelerating progress in personalized healthcare requires learning the causal relationships between diseases, genes, treatments, medications, labs, and other clinical information – at scale over a large population and time range. More than half of the clinically relevant data in oncology is only found in free-text pathology reports, radiology reports, sequencing reports, and progress notes.
Extracting and normalizing these facts from these clinical documents requires training oncology-specific models that can accurately extract these specific facts from a variety of documents. This talk describes results and lessons learned, from a real-world project doing this at scale, in three areas:
1. Applying state-of-the-art deep-learning based NLP models for entity recognition, entity resolution, negation detection, and document segmentation. This is one of the first projects outside a research setting applying BioBERT and we’ll compare versus “vanilla” BERT, share tricks for improving embeddings using vocabularies, and the impact of this form on transfer learning on the ability to learn from small labeled datasets.
2. Using Spark NLP for training and inference of these NLP pipelines – to unify processing from document loading to generating final results, that runs well locally and scales natively on a Spark cluster. We’ll share benchmarks from optimized builds on Intel and Nvidia hardware.
3. Considerations for architecting an AI platform that can process protected health information (PHI) in a secure and compliant way, for both training and inference. This involves operating the whole process – data integration, experimentation, scaling to a cluster, versioning, reproducibility, model deployment – within an air-gap environment without Internet access.
Bio: David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a Ph.D. in computer science and master’s degrees in both computer science and business administration.