Where do I see Schedule In-Brief?

The ODSC Schedule overview is available on this page

What sessions are included in my pass?

  • ODSC Talks/Keynotes schedule includes Tuesday, April 23–  Thursday, April 25. In-person sessions are available to Silver, Gold, Platinum, Mini-Bootcamp, and VIP Pass holders. Business talks are available to  Ai x Pass holders. Virtual Sessions are available to Virtual Premium, Virtual Platinum & Virtual Mini-Bootcamp pass holders.
  • ODSC Trainings are scheduled from Monday, April 22 –  Wednesday, April 24. In-person sessions are available to Platinum, Mini-Bootcamp, and VIP Pass holders. Virtual Sessions are available to Virtual Platinum & Virtual Mini-Bootcamp pass holders.
  • ODSC Workshop/Tutorials are scheduled from Tuesday, April 23–  Thursday, April 25. All in-person sessions are available to VIP, Platinum, Mini-Bootcamp and Gold pass holders. Silver Pass holders can attend only on Wednesday and Thursday.  Virtual Sessions are available for Virtual Premium, Virtual Platinum & Virtual Mini-Bootcamp pass holders.
  • ODSC Bootcamp Sessions are scheduled VIRTUALLY on Monday April 22nd, as pre-conference training. They are ONLY available for Mini-Bootcamp, and VIP Pass and Virtual Mini-Bootcamp holders.
Please Note: In-person attendees will have access to virtual sessions. All sessions will fill up on a first-come-first-served basis.
If you have a virtual pass, please remember that we will not live-stream any in-person sessions. Only virtual sessions will be recorded.
Conference Time: ET (UTC - 5) * Please note that we will not be live-streaming in-person sessions – only virtual sessions will be recorded.
Register now

70+ more sessions coming soon!

To View the Sessions Below – Click the SESSION TYPE tab and then click THE DAY tab you want to view

Conference Time: ET (UTC – 5) * Please note that we will not be live-streaming in-person sessions – only virtual sessions will be recorded.

All Conference Times are Eastern Standard Time (ET, UTC – 4) 



Click here to access the prerequisites

9:00 am - 9:25 am Responsible AI

AI and Society

Alex Pentland, PhD Professor at MIT | Founder and Director at MIT Connection Science

Ai X Keynote: AI is changing the way governments, companies, and individuals behave. By chosing the right sort of architectures for AI and for data we can make this transition much more safer and more likely to produce a healthy society. I will describe our community transformers approach to AI (http://transformers.mit.edu) combines new privacy-preserving AI allowing communities to derive insights without sharing data, zero knowledge proofs that allow verification of data and inference, and governance mechanisms that minimize the potential for liability can come together to produce “human centric” AI.

9:30 am - 10:00 am Data Engineering

The 12 Factor App for Data

James Bowkett Technical Delivery Director at OpenCredo

DE Summit: Data is everywhere, and so too are data-centric applications. As the world becomes increasingly data-centric, and the volumes of that data increase over time, data engineering will become more and more important. If we're going to be dealing with petabytes of data it will be better to get the fundamentals in place before you start, rather than trying to retrofit best practices onto mountains of data. This only makes a difficult job harder. The 12-factor app helped to define how we think about and design cloud native applications. In this presentation, I will discuss 12 principles of designing data-centric applications that have helped me over the years across 4 categories : Architecture & Design, Quality & Validation (Observability), Audit & Explainability, Consumption. This has ultimately led to our teams delivering data platforms that are both testable and well-tested. The 12 factors also enable them to be upgraded in a safe and controlled manner and will help them get deployed quickly, safely and repeatedly. This talk will be filled with examples and counter examples from the course of my career and the projects that my teams have seen over the years. It will incorporate software engineering best practices and how these apply to data-centric engineering. We hope that you can benefit from some of our experience to create higher quality data-centric applications that scale better and get into production quicker.

9:30 am - 10:00 am Data Engineering

Breaking the ice: How Apache Iceberg is Revolutionizing the Modern Data & AI Stack

Roy Hasson VP of Product at Upsolver

DE Summit: The days of the modern data & AI stack are over. Single vendor solutions to store, process, train and serve analytics and AI are often easier to get started but quickly become expensive and limit your ability to experiment with new tools, models and technologies. Apache Iceberg is an open table format that decouples the storage and data maintenance functions of a warehouse, creating an open, scalable and cost-effective "shared storage" on top of object stores. This shared storage is the basis for the new Lakehouse architecture, enabling a wide range of tools to discover, transform and access data. In this session, you'll learn why creating a shared storage using Iceberg is the future and how it fits into the Lakehouse architecture of your dreams. We’ll go under the hood of Iceberg to teach you how it enables distributed transactions, multiple concurrent writers and how it manages physical data and metadata. We’ll also discuss the various tools supporting Iceberg and how to get started building your new, open, scalable and flexible data stack. I hope you join me on this cool journey, breaking through the ice to find the future of data and AI platforms. Session Outline: I’ll go under the hood of Iceberg to teach you how Iceberg enables distributed transactions, multiple concurrent writers and how it manages physical data and metadata. I’ll also discuss the various tools supporting Iceberg and how to get started building your new, open, scalable and flexible data stack. Tools included, Apache Iceberg itself and engines using Iceberg (Spark, Trino, DuckDB, ClickHouse, PyIceberg).

9:30 am - 10:00 am

Identifying AI Use Cases for Maximum Business Value

Rudina Seseri Founder and Managing Partner at Glasswing Ventures

Ai X Track Keynote: Enterprises want to “use AI,” but what exactly does that mean? In 2023, it seemed like AI was everywhere; but was it really? In 2024, we move beyond the realm of experimentation and into the world of foundational business transformation. This keynote features Rudina Seseri, Co-Founder and Managing Partner of Glasswing Ventures, for a dive into how the emerging world of “ambient” AI drives comprehensive and proactive advances across a business. From manufacturing to sales and the life sciences, the emerging ubiquity of AI technologies leverages data not just to deliver optimization and human augmentation at unprecedented scale, but also to unlock entirely new capabilities for end users. Finally, the session explores how to design a practical AI system that drives ultimate business value in a world where HOW leaders utilize AI can be make-or-break in a changing ecosystems of competitors and peers. Attendees will learn how to identify AI techniques and architectures beyond and including Generative AI, and understand how these elements can be combined into effective business applications.

10:00 am - 10:30 am

Retrieval Augmented Generation and the Need for Private AI

Amanda Saunders Director, Generative AI Product Marketing at NVIDIA Corporation
Hunter Almgren Distinguished Technologist at Hewlett Packard Enterprise

Ai X Talk: Session Abstract Coming Soon!

10:00 am - 10:30 am Data Engineering

Designing ETL Pipelines with Delta Lake and Structured Streaming — How to Architect Things Right

Tathagata Das Staff Software Engineer at Databricks

DE Summit: Structured Streaming has proven to be the best framework for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiple ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner. In this talk, I am going to examine a number of common streaming design patterns in the context of the following questions. - WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements? - WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements? - HOW are going to architect the solution? And how much are you willing to pay for it?

11:00 am - 11:30 am Data Engineering

Data Infrastructure through the Lens of Scale, Performance and Usability

Ryan Boyd Co-founder at MotherDuck

DE Summit: Silicon Valley engineers and engineering challenges have ruled the data world for the last 20 years. The net result is data infrastructure companies focusing on being the highest scale, fastest systems to process enormous amounts of data– usability be damned. We don’t all have movie libraries the size of Netflix, search indexes the size of Google or social graphs the size of Meta. This talk explores the changes in hardware and mindsets enabling a new breed of software that is optimized for the 95% of us who do not have petabytes to process daily. I worked on Google BigQuery in 2012. At the time, the max size of memory on an EC2 machine was 60.5GB. Today, we have EC2 machines with 25TB of RAM. Our software design for data services, focused on distributed architectures, hasn’t taken into account that massive 400x change in the amount of memory available. At the same time, our laptops have gotten so much more powerful - with 16x the amount of RAM available in today’s Macbook Pro vs the ones offered in 2012. Shouldn’t our data infrastructure be adapted to take advantage of this local compute? What does this change in hardware and software mean for the user experience? Instead of focusing on consensus algorithms for large-scale distributed compute, can our engineers instead focus on making data more accessible, more usable and reduce the time between “problem statement” and “answer?” That’s the dream that I’m exploring and where I want to push our industry over the next 5 years.

11:00 am - 11:30 am

How to Scale Trustworthy AI

Paul Hake Principal AI Engineer | IBM

Solution Showcase: Deriving value from AI, starts with rigorous data procurement for your models and ends with control points for documenting the basis for responsible deployments of models and applications – focused on governance, risk assessment, bias mitigation, and compliance. To benefit from generative AI, companies need a unified AI and data platform that infuses intelligence into your business operations, while offering flexibility and scalability to meet your specific business use case and requirements. Attend the IBM session to learn how to scale generative AI deployments with trusted data and governance, spanning the entire AI model lifecycle.

11:00 am - 11:30 am

From Pilot to Scale: How Companies are Deploying Generative AI in 2024

Carlo Appugliese Director, Generative AI, Machine Learning and AI Engineering at IBM

There’s a lot of interest and buzz around generative AI, however, over 90% of companies are still in the experimental phase. So how do you take those pilots to production and drive ROI for your business? Join Carlo Appugliese, Program Director, Client Engineering IBM watsonx as he shares top insights from IBM’s generative AI client engagements, including how to develop a model strategy to keep costs down and improve performance while using AI responsibly.

11:00 am - 11:30 am Generative AI

From Gen AI to Digital Twins: Powering the Next AI Revolution

Santosh Kumar Radha Head of Product/Research at Agnostiq.ai

Ai X Talk: In the dawn of the AI industrial revolution, advanced technologies like digital twins, generative AI and multi-agents are revolutionizing industries, predicting weather, training humanoid robots, accelerating drug discovery, and redefining design and manufacturing processes. The vision of multi-agent systems that can mobilize an army of specialized agents to work hand in hand in harmony with humans is rapidly becoming a reality, promising enhanced efficiency and innovation. Yet, these transformative advancements entail a significant surge in computational requirements, straining our existing infrastructures to their limits and necessitating a shift towards a novel infrastructure paradigm. Unlike traditional cloud applications, the compute-intensive AI landscape requires a fundamental rethinking of compute interaction. AI applications necessitate a flexible, hybrid infrastructure to handle their dynamic training and inference needs, diverse resource requirements, and complex queue management stemming from limited, finite resources. This shift introduces challenges for organizations. Custom scripts, disjointed team efforts, and unpredictable cloud expenses can lead to inefficient compute management, increased costs, and delayed time-to-market. Organizations struggle to choose between hyperscale clouds, specialized microclouds, or dedicated AI-centric compute clusters, all while navigating the complexities of a new infrastructure software stack. The main challenges include increased costs due to inefficient compute management, delays in time to market, and the potential for missed opportunities stemming from the complex infrastructure software stack required to support these new computational needs. To address these challenges, the new era of AI needs a unified platform. This requires infrastructure capable of orchestrating high-performance computing, specialized hardware, and cloud resources. Join Santosh as he navigates the complexities of the AI compute landscape, offering practical strategies for success amidst evolving technological frontiers.

11:00 am - 11:30 am

How the Kansas City Chiefs Champion Digital Transformation with Python Analytics

Andrew Schutte Senior Data Scientist at Kansas City Chiefs
Michael Ragsdale Vice President of Finance, Strategy, and Analytics at Kansas City Chiefs

As consecutive Super Bowl champions in 2023 and 2024, the Kansas City Chiefs have demonstrated immense success both on and off the field. With one of the highest home attendance rates in the NFL, their front office prioritized data analytics as the center of business and game-day stadium operations. They needed a flexible, scalable solution that could support a continuous stream of live data, so they turned to Dash Enterprise to create secure Python data applications. Join this special presentation from Michael Ragsdale and Andrew Schutte to learn how the Chiefs achieved the following: ● Live data processing in game-day apps to increase understanding of crowd flows, parking logistics, and staffing ● 32 person-hours saved every week from automating manual workflows ● Reduced software licensing costs while delivering high-value business analytics

11:20 am - 11:50 am Generative AI

Generative AI for Social Good

Colleen Molloy Farrelly Chief Mathematician at Post Urban Ventures

This talk will focus on current generative AI methods, including image and text generation, with a focus on social good applications, including medical imaging applications, diversity training applications, public health initiatives, and underrepresented language applications. We'll start with an overview of common generative AI algorithms for image and text generation before launching into a series of case studies with more specific algorithm overviews and their successes on social good projects. We'll explore an algorithm called TopoGAN that is being used to augment medical image samples. We'll look at GPT-4 and open-source large language models (LLMs) that can generate cases of bias and fairness. We'll consider how language translation and image generators such as stable diffusion can quickly produce public health campaign material. Finally, we'll explore language generation with low-resource languages like Hausa and Swahili, highlighting the potential for language applications in the developing world to aid businesses, governments, and non-profits communicating with local populations. We'll end the talk with a discussion of ethical generative AI and potential for misuse. Learning outcomes include familiarity with common generative AI algorithms and sources, their uses in a variety of settings, and ethical considerations when developing generative AI algorithms. This will equip programming-oriented data scientists with a background to implement algorithms themselves and business-focused analytics professionals with a background to consider strategic initiatives that might benefit from generative AI.

11:35 am - 12:05 pm

The Mesholith: Practical Tips to Building a Semi-Centralized Data Platform

Stephen Bailey, PhD Engineer Manager | Whatnot

DE Summit: Abstract Coming Soon!!!

11:35 am - 12:05 pm

How to Level-up your Creativity and Projects with On-demand Analytics

Joe Madden Senior Product Manager at SAS

Solution Showcase: Often, data scientists build their models on an entirely different stack from ML engineers creating inefficiencies, silos and leading to frustration. In this demo, we will show how a streamlined handoff between a data scientist and their ML engineer counterpart helps to get the model into production faster. The scenario covers a data scientist or developer, using Python or SAS, creating a model that can easily be shared, published, and registered to their production system of choice.

11:35 am - 12:05 pm Machine Learning

Unlocking Hidden Value: Data Science + Optimization (Real-world Examples & Practical Coaching)

Rob Stevens Vice President at First Analytics

Data science and optimization tools naturally complement each other: each can solve problems that the other can’t. Yet optimization and data science are typically taught in different degree programs and practitioners of one discipline often know less than they should about the other.​ This session will explain how the toolsets differ and what each brings to the table. We’ll show real-world examples of optimization and predictive models complementing each other, point out resources where you can learn more, and give advice on when you need one vs. the other.​ This session is aimed at both leaders and hands-on analysts who are interested in the big picture, but who also may want some very practical coaching.​

11:35 am - 12:35 pm Generative AI

Panel: Data Analytics in the Age of AI

Peter Luo, PhD Co-Founder, CEO & Chief AI Officer at DTonomy
Kirsten Stone Head of Quantitative Manager Research at BNY Melon
Derya Isler Vice President, Search and Personalization at Sirius XM
Aleksandar Tomic, PhD Associate Dean, Strategy, Innovation, and Technology at Boston College

Abstract Coming Soon!

12:10 pm - 12:40 pm LLMs

CodeLlama: Open Foundation Models for Code

Baptiste Roziere Research Scientist at Meta

In this session, we will present the methods used to train Code Llama, the performance we obtained, and show how you could use Code Llama in practice for many software development use cases. Code Llama is a family of open large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, 34B, and now 70B parameters each. Code Llama reaches state-of-the-art performance among open models on several code benchmarks. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other open model on MultiPL-E. Code Llama was released under a permissive license that allows for both research and commercial use.

12:10 pm - 12:40 pm Data Engineering

Automating Data Engineering Workloads Using Generative AI

Ashish Mrig Director of Data Services at HealthEdge

A typical data organization spend a lot of time building the following artifacts among other things: * Ingesting customer files * Generating Business Metrics * Building Analytics Reports * Publishing Data Extracts * Creating Feature Stores for ML. Lot of these tasks are repetitive, time consuming and creates lot of complexity for the data organization team which creates technical debt and a long delivery cycles. Welcome to the world of Generative AI where we will use commercially available Large Language Models (LLMs) such as Open AI to demonstrate how solve these problems using innovative design patterns. In this session, we will be talking about Generative Data Engineering, starting from basic concepts and contexts and progressing to how to generate your data pipeline and analytics code rather than writing the code manually. These design patterns go a long way in shortening the lead time, automating undifferentiated work, and allowing data teams to move up the data value chain. These can be learned relatively quickly and can be implemented either as a simple one off solution or can be deployed as an enterprise framework.

12:10 pm - 12:40 pm

Dive into the Lightning AI Open Source Stack and Lightning Studios to Unlock Reproducible AI Development on the Cloud.

Luca Antiga CTO at Lightning AI

PyTorch Lightning is a leading open source framework that was used to train several of the best generative AI models. With over 100 million downloads, it is the framework of choice for researchers and companies worldwide to train and fine-tune AI models. PyTorch Lightning and the rest of the Lightning AI stack, which includes Fabric, TorchMetrics, litgpt, litdata, and Thunder, provides a cutting-edge open-source foundation for practitioners. Just as the Lightning AI open source stack is democratizing access to cutting-edge AI research and engineering, Lightning Studios democratize access to cloud computing resources and solve the challenges of reproducibility and collaboration. Lightning Studios offer a laptop-like cloud experience, enabling seamless accelerated computing for individuals and organizations. Studios provide reproducible environments for development, training, and hosting applications across diverse hardware setups, with support for multi-node training and parallel data processing. Attendees will gain firsthand experience and access to datasets, models, and source code.

12:40 pm - 1:10 pm Large Language Models

LLM-native Products: Industry Best Practices and What's Ahead

Ivan Lee CEO / Founder at Datasaur.ai

Ai X Talk: The next generation of LLM-powered products will not look and feel like ChatGPT. ChatGPT captured everyone's imagination because it could do everything - suggest international travel plans, generate ideas for dates, and analyze lengthy legal contracts. Now companies are building AI models that are more narrowly focused and must avoid the recurring issues of hallucination and non-deterministic answers. The mistakes made by ChatGPT and its successors pose much greater business risks for the enterprise. An incorrect answer to a simple math problem or directions to a cafe that doesn't exist might make the news as a silly, laughable mistake. But a customer socially engineering a model to ask about confidential pricing documents, or an employee poking around HR performance reviews could result in a PR fiasco. In August 2023, a generative AI for recipes mistakenly produced a recipe that was poisonous to humans. As an NLP practitioner of over a decade, I have experience working with Fortune 100 companies and actively observe trends in how they are seeking to deploy LLM/GenAI technologies. This talk will explore the business objectives LLMs are most suited to address across all industries, and how organizations ranging from healthcare to eCommerce are leveraging this new technology to increase revenue or cut costs. We will also discuss how to avoid common pitfalls organizations make in scoping and building their projects, how to set reasonable goals and milestones for executive sponsors, and look ahead to a future where LLMs ubiquitously power all our day-to-day applications behind the scenes.

12:50 pm - 1:20 pm Data Visualization

Build-a-Byte: Constructing Your Data Science Toolkit

Jarai Carter, PhD Senior Manager at John Deere

Ai X Panel: Abstact Coming Soon!

1:10 pm - 1:40 pm Data Engineering

From Research to the Enterprise: Leveraging Large Language Models for Enhanced ETL, Analytics, and Deployment

Ines Chami Co-founder and Chief Scientist at NUMBERS STATION AI

As Foundation Models (FMs) continue to grow in size and capability, data is often left behind in the rush toward solving problems involving documents, images, and videos. This talk will describe our research at Stanford University and Numbers Station AI on applying FMs to structured data and their applications in the modern data stack. Starting with ETL/ELT, we'll discuss our 2022 VLDB paper ""Can Foundation Models wrangle your data?"", the first line of work to use FMs to accelerate tasks like data extraction, cleaning and integration. We'll then move up the stack and discuss our work at Numbers Station to use FMs to accelerate data analytics workflows, by automating tasks like text-to-SQL generation, semantic catalog curation and data visualizations. We will then conclude this talk by discussing challenges and solutions for production deployment in the modern data stack.

1:30 pm - 2:00 pm Generative AI

Programming LLMs for Business Application is Way Better Than 'Tuning' Them

Tsvi Lev Managing Director of the NEC Research Center in Israel at NEC Corporation

In modern enterprises many employees rotinely have to perform tasks related to text understanding and basic manipulation. These include e.g. classification, routing, data extraction into spreadsheets/databases, and formatted summarization into a report. Typically, LLM based approaches to these tasks aim to 'train' or 'fine tune' the LLM to do the same, based on curated labeled data by the organisation itself, and/or external open source datasets from the relevant industry. However, the preparation of such datasets is an expensive, time consuming process. Another issue is that in cases the tuned LLM makes mistakes, it is not clear how many and which new labeled data is required to solve the issue. This makes actual commercial use difficult for many use cases. Humans, of course, can accomplish the same tasks by being given a written or oral policy. In the NEC research labs, we have created an LLM based process of converting policies into 'prompt ensembles' that are used to effectively 'program' the existing, untuned or lightly tuned LLM to give discrete, constrained-by-prompt answers which are then aggregated and filtered to yield the final result. Constraining the results by breaking up the policy to a 'prompt ensemble' prevents the problem of hallucinations as the LLM answers are discrete or very short. At the same time, the accuracy level is increased by using partially overlapping/repeating prompts. When the resulting system does not implement the policy as desired, or the corporate policy itself is changed, it is easy to locate and modify/add the relevant prompts, and verify the mistake will not be repeated. We will show sample cases where we have successfully employed this method to several common problems: e.g. fine text classification of emails on the Enron dataset, and compliance verification on contracts and budget related corporate texts. The process is not strongly coupled to the use of a specific LLM, nor to a specific language. As LLMs become stronger, our method directly benefits from that.

2:00 pm - 2:10 pm Responsibe AI

Language Modeling, Ethical Considerations of Generative AI, and Responsible AI

Madiha Shakil Mirza NLP Engineer at Avanade

Women in Data Science Ignite: My session will focus on: • Technological evolution in Artificial Intelligence and Natural Language Processing which led to Generative AI • What is Generative AI • How Generative AI differs from Machine Learning and Deep Learning • How Large Language Models are built • The difference between Statistical Language Models vs Neural Language Models • Ethical considerations for Generative AI such as Bias, Privacy, Copyright, Intellectual Property Rights, Misinformation, and Environmental Impact • Responsible AI and how we can play our part to ensure that Large Language Models are developed and used responsibly

2:00 pm - 2:30 pm Data Engineering

Why the Hype Around dbt is Justified

Dustin Dorsey Sr. Cloud Data Architect at Onix

DE Summit: The hype for dbt is everywhere you look, but is it really justified? Why do you need a tool to just run SQL when most data stores support SQL themselves? In this 30 minute session I am going to break down what dbt really is, what makes it unique, and show you why it is so much more then just SQL. We will look at what makes it so popular (and unpopular) as a data transformation tool and the driving factors behind those opinions, dispelling some mistruths along the way. If you are new to dbt and trying to wrap your head around this tool then this is the session for you! Come find out if the hype is real!

2:00 pm - 2:30 pm Generative AI

Designing AI Systems for Trust

Cal Al-Dhubaib Head of AI and Data Science at Further

Ai X Talk: As AI becomes integral to business strategy, many organizations are navigating the complex interplay between technical innovation, creating business value, and managing risk. Despite the progress, fewer than 10% of organizations have successfully deployed generative AI solutions (as of Jan 2024). In many cases challenges arise with human adoption, alignment with business values, risk management processes, and unexpectedly costly data curation efforts. With a focus on business and technical leaders responsible for bringing AI solutions to life, we will draw from best practices in designing and deploying AI solutions across mission-critical sectors such as healthcare, energy, and financial services, where trust is critical. Participants will walk away with some practical tools to lead their organizations in developing and deploying AI solutions that are not only technically sound but also widely trusted.

2:00 pm - 2:30 pm

Unblock Data Science Teams with Prefect: Self-serve Python Scripting to Deployment

Mike Grabbe Principal Data Engineer at Prefect.io
Taylor Curran Senior Sales Engineer & EF Educational Tours at Prefect.io

2:00 pm - 3:00 pm

Women in Data Science Ignite: Sharing Insights and Networking Session

Roopa Vasan Chief AI Architect at Leidos
Madiha Shakil Mirza NLP Engineer at Avanade

Hema Seshadri - Moderator Madiha Shakil Mirza - Language Modeling, Ethical Considerations of Generative AI, and Responsible AI Roopa Vasan - My AI: Awareness in Evolving AI Technologies DeAnna Duval - EmpowerHer Talks: Humanizing Leadership Yi-Chun Lai - Utilizing XGBoost to Predict High-Performance Microbial Communities in Wastewater Treatment Plants

2:10 pm - 2:20 pm Responsibe AI

My AI: Awareness in Evolving AI Technologies

Roopa Vasan Chief AI Architect at Leidos

Women in Data Science Ignite: AI permeates and has transformed daily routines – whether it's interacting with virtual assistants, personalized recommendations for user preferences, or smart home devices. As technology grows, so does the need to educate professionals outside of the traditional technical fields. Building an AI-literate community means developing people to ask the right questions– How do I approach or communicate about AI? What are the strengths and limitations of AI? How is the public impacted by AI implementations? What tools are needed to evaluate AI? Beyond the technical challenges, recommended regulations, such as Executive Order 14110, and program stakeholders impose increasing demand and requirements for AI explainability, fairness/equity, and formal governance, while also introducing confusion around what this actually means for people. Fundamental to this approach is a governance framework combining people, processes, and to ensure performance & enable trust in AI solutions. Join Leidos Chief AI Architect Roopa Vasan to discuss the implications and importance of having an AI-literate community that questions data responsibly and aligns with AI governance. By the end of the presentation, you will walk away with an appreciation for data literacy, stakeholders involved in responsible AI development, and first steps on implementing AI guardrails.

2:10 pm - 2:40 pm Data Engineering

Mastering Real-time Processing While Effectively Scaling Data Quality with Lambda or Kappa Architecture

Sreyashi Das Senior Data Engineer at Netflix
Vipul Bharat Marlecha Senior Big Data Engineer at Netflix

In a world that creates 2.5 quintillion bytes of data every day, auditing data at scale becomes a challenge of unprecedented magnitude. ‘Mastering Real-time Processing while effectively scaling Data Quality with Lambda or Kappa Architecture’ provides a deep-dive into powerful methodologies, revealing design patterns that turn this challenge into an opportunity for businesses. Join us as we navigate the complexities of data audits and discover how leveraging these techniques can drive efficiency, reduce latency, and deliver actionable insights from your data - at any scale.

2:20 pm - 2:30 pm

EmpowerHer Talks: Humanizing Leadership

DeAnna Duval Senior Manager, Intelligence Engagement at CompTIA

Session Abstract Coming Soon!

2:30 pm - 2:40 pm

Utilizing XGBoost to Predict High-Performance Microbial Communities in Wastewater Treatment Plants

Yi-Chun Lai Student at North Carolina State University

For a full-scale treatment plant, it is hard to identify the optimal microbial community assembly (MCA) for treating wastewater. Therefore, leveraging the power of the XGBoost model can help unlock the process performance. We used real-world data from a wastewater treatment plant in North Carolina to evaluate MCA and its treatment quality. The results revealed the relationships between alpha diversity, beta diversity, process performance, and MCA. The model can be applied to regional North Carolina wastewater treatment plants to help inform and monitor the changes in MCA and identify the optimal MCA for the process.

2:35 pm - 3:05 pm Data Engineering

Building Data Contracts with Open Source Tools

Jean-Georges Perrin CIO at AbeaData

DE Summit: It's less complicated than it seems. In this session, you will build your first data contracts. I will first set the decorum: * What is a data contract? * What's its purpose? * Why it simplifies data engineers' lives? Then we will jump into the hands-on part, which you will be able to run in your environment. I will use some (as of now, experimental) open-source tools to generate a skeleton of a data contract, and we will add information to it. Once you created a data contract, you will learn more about their life cycle. Join me for this fun and fast-paced session, filled with extremely relevant information.

2:35 pm - 3:05 pm

Mastering Complexity: Optimize your Decision Making for 500% ROI

Jennifer Locke Manager – Technical Account Management, Americas at Gurobi

Solution Showcase: Businesses focus a lot on forecasts and predictions – trying to get a clearer picture of the future. But even if you had perfect information, the most sprawling and impactful business decisions are much too complex to guarantee optimal outcomes, with millions, billions, or even trillions of trade-offs to consider. See why leading companies use mathematical optimization to solve their most complex real-world business problems.

2:35 pm - 3:05 pm Generative AI

How to Defend Against Weaponized Generative AI

Jacob Seidman Head of AI at Reality Defender

Ai X Talk: As generative AI advances at a lightning-fast pace, there is an ever-growing concern over how this technology can be and is being misused to spread misinformation and erode public trust. Deepfakes — highly realistic fake videos, images, and audio generated by AI — pose a major threat as potential tools of deception and propaganda across a variety of mediums. Defending against weaponized uses of generative AI is thus an urgent challenge, one that is necessary for the preservation of a harmonious society. This session will provide an overview of the deepfake landscape and discuss emerging techniques for detection and prevention. We will explore leading edge tools and research on deepfake detection and prevention, including methods that analyze artifacts and inconsistencies introduced during generation to identify manipulation. Going beyond detection, we will also cover proactive technical and policy interventions, such as digital provenance techniques and content authentication frameworks. Broader societal resilience strategies will also be discussed. With democratization of AI lowering barriers and cost to creation of convincing deepfakes, responsible governance of generative technologies is needed more than ever. Participants will leave this session with an understanding of deepfake risks, leading edge detection methods in development, and a roadmap toward fostering public awareness and policy solutions. By highlighting countermeasures across technology, media literacy, and regulation, this talk will provide actionable intelligence for defending against the misuse of AI to corrupt information ecosystems.

2:50 pm - 3:20 pm LLMs

Reasoning in Large Language Models

Maryam Fazel-Zarandi, PhD Researcher Engineering Manager, FAIR at Meta

Scaling language models has improved state-of-the-art performance on nearly every NLP benchmark, with large language models (LLMs) performing impressively as few-shot learners. Despite these achievements, even the largest of these models still struggle with tasks that require reasoning. Recent work has shown that prompting or fine-tuning LLMs to generate step-by-step rationales, or asking them to verify their final answer can lead to improvements on reasoning tasks. While these methods have proven successful in specific domains, there is still no general framework for LLMs to be capable of reasoning in a wide range of situations. In this talk, I will give an overview of some of the existing methods used for improving and eliciting reasoning in large language models, methods for evaluating reasoning in these models, and discuss limitations and challenges.

3:30 pm - 4:00 pm Data Engineering

Clean as You Go: Basic Hygiene in the Modern Data Stack

Eric Callahan Principal, Data Solutions at Pickaxe Foundry

DE Summit: When my children walk around the house, they generally leave a trail of mess behind them. They sometimes realize that they shouldn't be doing this, but they’re so excited to move on to the next thing that catches their eye that they’ll say “Oh, I’ll clean it up later.” As grown adults with wisdom gained from experience, my wife and I know that this means either: They’ve just signed themselves up for a massive future cleaning job, or … … that someone else will have to clean up after them. We know that this is not good behavior for a child, so why do we so often do this as Data Engineers? The culture of “Move Fast and Break Things” has pressured us into closing tickets as quickly as possible, frequently pushing us towards the “Oh, I’ll clean it up later” mindset. While this may save us a few minutes in the short-term, we are creating long term headaches such as: Piles of small cleanup tasks for later Confusion among peers who try to use incomplete data assets Lack of metadata to activate throughout the Modern Data Stack

3:30 pm - 4:00 pm Responsible AI

Resisting AI

Dr. Dan McQuillan Lecturer in Creative and Social Computing at Goldsmiths, University of London

This session will introduce the arguments set out in the book 'Resisting AI'. The objective is to reframe the operations of AI as social and political as well as technical, to highlight their potential for amplifying social harms, and to equip participants with alternative ways to assess the social purpose of their work.

3:30 pm - 4:00 pm Data Engineering

The Future of the Single Source of Truth is an Open Data Lake

Christina Taylor Data Engineering Lead at Abridge AI

Join me on an exciting journey where we build a centralized data repository that seamlessly ingests data from a wide range of sources, including service databases, SAAS applications, files, and conversational data. Witness how implementing an open format can substantially reduce cloud costs and prevent vendor lock-in. Discover how cloud file targets can decouple compute and storage, streamlining data pipeline efficiency. I will delve into the EL (extract/load) process and provide a quantitative approach that will help you choose the most appropriate technology. Whether you seek to scale analytics, machine learning, or product use cases, I will guide you toward a future-proof data strategy. I hope the audience is empowered to choose open source technology and open table format. https://delta.io/for the UniForm project

3:30 pm - 4:00 pm

Visualizing the Evolution of ML Models: Insights and Tools for Enhanced Understanding

Rajat Arya Co-founder at XetHub

Solution Showcase: The increasing size and complexity of machine learning (ML) models make understanding their evolution crucial. This talk will demonstrate how to track ML model changes using visualizations, from classical tree models to complex deep learning architectures. We will showcase XetHub (and other tools) for building these visualizations and reasoning about model changes. The talk will conclude with extending this framework for adding visual context at any stage of an ML project.

3:30 pm - 4:30 pm Generative AI

Panel: Generative AI in Finance

Sammy Assefa SVP, Head of AI and ML, Enterprise Innovation at U.S. Bank
Usama Fayyad, PhD Chairman & Founder at Open Insights | Inaugural Exec Director, Institute for Experiential AI & CS Professor at Northeastern University
Dr. Anju Kambadur Head of AI Engineering at Bloomberg
Roger Burkhardt Capital Markets Chief Technology Officer and Co-head of AI at Broadridge Financial Solutions

Ai X Panel: Abstact Coming Soon!

4:05 pm - 4:35 pm Generative AI

The Value of A Semantic Layer for GenAI

Krishna Srihasam Consultant Data Scientist at AtScale
Jeff Curran Senior Data Scientist at AtScale

By empowering an LLM with the logical context of a semantic layer, we can incorporate business terminology and logic into its responses and enable queries to the database using natural language ( instead of SQL). Coupling this with AtScale’s query engine results in increased performance in the face of Natural Language prompts from business users. In this talk we will explore an application of such a model, in the form of an LLM and Semantic Layer backed Chat Bot. Attendees will gain insights into: The improvements seen against existing LLM benchmarks when coupling the model with a semantic layer. Insight in how the LLM will translate natural language requests into efficient sql. Gaining value from an LLM with a chat interface. Who should attend this session? Data Analysts and Data Scientists, Business Intelligence (BI) Professionals, Data Product Managers, Executives and Decision-Makers

4:35 pm - 5:35 pm Generative AI

Ai X Panel: Applied AI in action: Getting mission critical Novel AI into production

Edward (Ted) Kwartler Managing Director, North America Responsible AI Lead at Accenture | Adjunct Professor at Harvard Extension School
Andrew Smeaton Global Chief Information Security Officer at Afiniti
Stephen (Steffin) Harris Former Corporate VP at Microsoft
Nick King Founder/CEO at Data Kinetic
David Gonzalez Co-Founder/CEO at energy.work

Hear from experts breaking down key topics, and challenges in getting AI in production environments and interesting use cases. From anomaly detection in offshore oil rigs, human trafficking, supply chain disruption and more. We’ll also go into some of the techniques used, ethically boundaries, AI capabilities, and how to blend unique concepts to production applications.

4:40 pm - 5:10 pm Data Engineering

Unlock Safety & Savings: Mastering a Secure, Cost-Effective Cloud Data Lake

Johnathan Azaria Data Science Tech Lead at Imperva
Ori Nakar Principal Engineer, Threat Research at Imperva

DE Summit: Have you ever experienced a surge in your cloud data lake expenses? Is this surge indicating a malicious activity or a legitimate operation? Data lakes have become a cornerstone of the digital age, prized for their flexibility and cost-effectiveness. Yet, as they expand, they bring forth challenges in security, access control, cost management, and monitoring. The stakes are high: unauthorized access can lead to data breaches, while even legitimate users can inadvertently drive up costs. With the growth in usage comes far more complexity. The size of data, together with the number of objects, are growing rapidly. A growing number of users, both human and application, are performing constant operations on the data lake. The large number of operations makes access and cost control a hard and ongoing task. Monitoring is also a complex task, since there are many access options, and all should be monitored. Traditional monitoring methods often fall short. Tracking object store access can be overwhelming, with a single query generating thousands of log records. Monitoring at the query engine level demands a unique solution for each engine, adding complexity. Join us as we unveil two novel techniques for data lake monitoring, leveraging both object store logs and query engine logs. Dive deep into our aggregation strategies and discover how anomaly detection can be applied to this consolidated data. We'll explain how enhanced access control mechanisms can fortify your data lake's security, mitigating the risk of data leaks and data corruption. Additionally, we'll shed light on how to harness these insights to minimize the attack surface, identify and fix cost anomalies and system glitches. Embark on this journey with us and uncover the secrets to optimizing the security and cost-efficiency of your data lake operations.

9:00 am - 9:25 am

ODSC Keynote: Trust, Transparency & Secured Generative AI

Kate Soule Program Director for Generative AI Research at IBM

Public adoption of generative AI has shown everyone how transformative the technology can be. It holds the potential to create competitive advantage by driving the speed and efficiency of operations ranging from IT automation, digital labor, and customer care use cases. But not all AI models are created equal – the best models will depend on your industry, domain, and use case. And that’s why having options is essential for successfully adopting AI within your business, that’s grounded in trust and transparency to ensure the deployment of responsible AI. Learn about IBM’s approach to selecting the right AI foundation model for the right task to accelerate your generative AI strategy in this interactive session.

9:30 am - 9:55 am

ODSC Keynote: Learning from Mistakes: Empowering Humans to Use AI the Right Way in High-Stakes Decision Making

Hilke Schellmann Author of "The Algorithm", Assistant Professor of Journalism at New York University

Session Abstract Coming Soon!

9:30 am - 9:55 am LLMs

ODSC Keynote: Setting Up Text Processing Models for Success: Formal Representations versus Large Language Models

Carolyn Rosé, PhD Professor, Program Director for the Masters of Computational Data Science at Carnegie Mellon University

With increasingly vast storehouses of textual data readily available, the field of Natural Language Processing offers the potential to extract, organize, and repackage knowledge revealed either directly or indirectly. Though for decades one of the holy grails of the field has been the vision of accomplishing these tasks with minimal human knowledge engineering through machine learning, with each new wave of machine learning research, the same tensions are experienced between investment in knowledge engineering and integration know-how on the one hand and production of knowledge/insight on the other hand. This talk explores techniques for injecting insight into data representations to increase effectiveness in model performance, especially in a cross-domain setting. Recent work in neural-symbolic approaches to NLP is one such approach, in some cases reporting advances from incorporation of formal representations of language and knowledge and in other cases revealing challenges in identifying high utility abstractions and strategic exceptions that frequently require exogenous data sources and the interplay between these formal representations and bottom-up generalities that are apparent from endogenous sources. More recently, Large Language Models (LLMs) have been used to produce textual augmentations to data representations, with more success. Couched within these tensions, this talk reports on recent work towards increased availability of both formal and informal representations of language and knowledge as well as explorations within the space of tensions to use this knowledge in effective ways.

10:00 am - 10:30 am

How to Scale Trustworthy AI

Paul Hake Principal AI Engineer | IBM

Session Abstract Coming Soon!

10:00 am - 10:30 am Generative AI

Build GenAI Systems, Not Models

Hugo Bowne-Anderson, PhD Head of Developer Relations at Outerbounds

This talk explores a framework for how data scientists can deliver value with Generative AI: How can you embed LLMs and foundation models into your pre-existing software stack? How can you do so using Open Source Python? What changes about the production machine learning stack and what remains the same? We motivate the concepts through generative AI examples in domains such as text-to-image (Stable Diffusion) and text-to-speech (Whisper) applications. Moreover, we’ll demonstrate how workflow orchestration provides a common scaffolding to ensure that your Generative AI and classical Machine Learning workflows alike are robust and ready to move safely into production systems. This talk is aimed squarely at (data) scientists and ML engineers who want to focus on the science, data, and modeling, but want to be able to access all their infrastructural, platform, and software needs with ease!

10:00 am - 10:30 am Responsible AI

From Code to Trust: Embedding Trustworthy Practices Across the AI Lifecycle

Vrushali Sawant Data Scientist, Data Ethics Practice at SAS

As AI systems permeate our lives, ensuring their robustness, fairness, and security is non-negotiable. But what does it mean for a technical person sitting behind the scenes writing code? How do you incorporate these practices while cleaning data, developing models, deploying, and managing AI systems? This session will delve into steps to get you started on your journey to building trustworthy AI systems, from a technical perspective. We will showcase how to operationalize the trustworthy AI principles into practice and share a starter guide for implementing trustworthy AI across the AI lifecycle.

10:40 am - 11:10 am Responsible AI

Advancing Ethical Natural Language Processing: Towards Culture-Sensitive Language Models

Gopalan Oppiliappan Head, AI Centre of Excellence at Intel India

Natural Language Processing (NLP) systems play a pivotal role in various applications, from virtual assistants to content generation. However, the potential for biases and insensitivity in language models has raised concerns about equitable representation and cultural understanding. This talk explores the development of Culture-Sensitive Language Models (LLMs) as a progressive step towards addressing these issues. The core principles involve diversifying training data to encompass a wide range of cultures, implementing bias detection and mitigation strategies, and fostering collaboration with cultural experts to enhance contextual understanding. Our approach emphasizes the importance of ethical guidelines that guide the development and deployment of LLMs, focusing on principles such as avoiding stereotypes, respecting cultural diversity, and handling sensitive topics responsibly. The models are designed to be customizable, allowing users to fine-tune them according to specific cultural requirements, fostering inclusivity and adaptability. The incorporation of multilingual capabilities ensures that the models cater to global linguistic diversity, acknowledging the richness of different languages and cultural expressions. Moreover, we propose a feedback mechanism where users can report instances of cultural insensitivity, establishing a continuous improvement loop. Transparency and explainability are prioritized to enable users to comprehend the decision-making process of the models, promoting accountability. Through this multidimensional approach, we aim to advance the field of NLP by developing culture-sensitive LLMs that not only understand and respect diverse cultural nuances but also contribute to a more inclusive and ethical use of language technology.

11:00 am - 11:30 am Machine Learning

Who Wants to Live Forever? Reliability Engineering and Mortality

Allen Downey, PhD Curriculum Designer at Brilliant.org | Professor Emeritus at Olin College

Reliability engineering is the study of survival and failure in engineered systems, but its methods can be applied as well in natural and social sciences, and business. It reveals surprising patterns in the world, including many examples where used is better than new -- that is, we expect a used part to last longer than a new one. In this talk, I'll present tools of reliability engineering including survival curves, hazard functions, and expected remaining lifetimes. And we'll consider examples from a variety of domains, including light bulbs, computer systems, and life expectancy for humans and institutions. Intuitively, we expect things to wear out over time: a new car is expected to last longer than a used one, and a young person is expected to live longer than an old person. But many natural and engineered systems defy this intuition. For example, in the last weeks of pregnancy, the process becomes almost memoryless: the expected remaining duration levels off at four days, and stays there for almost four weeks. Other examples entirely invert our expectations, so the longer something has survived, the longer we expect it to survive. Until recently, nearly every baby born had this property, due to high rates of infant mortality. Computer programs, data transfers, and freight trains have it, too. Understanding this behavior is important for designing computer systems, interpreting a medical prognosis, and maybe finding the key to immortality.

11:00 am - 11:30 am Generative AI

Imputation of Financial Data Using Collaborative Filtering and Generative Machine Learning

Arun Verma, PhD Head of Quant Research Solutions Team, CTO Office at Bloomberg

Quant traders and data scientists regularly user automated ML & AI technologies to extract a variety of information from large datasets, e.g. sentiment from news data, or scoring methods for complex data sets like Supply Chain and ESG. ML methods are also being used for imputation of financial data as well as prediction of asset prices. This talk will provide a brief overview of the following topics: The broad application of machine learning in finance: opportunities and challenges. Machine Learning techniques for Imputation e.g. estimating granular Geographical Exposure of companies given partial & high-level disclosure from the company financial statement Collaborative filtering techniques for illiquid asset pricing, use of data driven methods to inform price movements in a target instrument from observations on related liquid instruments

11:00 am - 11:30 am Data Engineering

Guardrails for Data Teams: Embracing a Platform Approach for Workflow Management

Jeff Hale Head of Developer Education at Prefect.io
Bill Palombi VP of Product at Prefect.io

Session Abstract Coming Soon!

11:00 am - 11:30 am

HPCC Systems – The Definitive Big Data Open-Source Platform

Bob Foreman Senior Software Engineer at LexisNexis Risk Solutions

Solution Showcase: Learn why the completely free and open source HPCC Systems platform is better at Big Data and offers an end-to-end solution for Developers and Data Scientists. Learn how ECL can empower you to build powerful data queries with ease. HPCC Systems, a comprehensive and dedicated data lake platform makes combining different types of data easier and faster than competing platforms — even data stored in massive, mixed schema data lakes — and it scales very quickly as your data needs grow. Topics include HPCC Architecture, Embedded Languages and external datastores, Machine Learning Library, Visualization, Application Security and more.

11:20 am - 11:50 am Generative AI

Beyond Theory: Effective Strategies for Bringing Generative AI into Production

Heiko Hotz Generative AI Global Blackbelt | Google

In the rapidly evolving and constantly advancing landscape of artificial intelligence, foundation models like GPT-4 and DALL-E 3 and the broader world of generative AI have emerged as potential game-changers, offering unprecedented and previously unimagined capabilities across a wide variety of domains and use cases. However, while these theoretical models showcase promising capabilities, the practical challenge of transitioning from conceptual research to full-scale production-level applications remains a major obstacle that many organizations and teams continue to face. This keynote presentation aims to help bridge this gap by taking a deep dive into exploring pragmatic and actionable strategies and best practices for successfully integrating these cutting-edge AI technologies into real-world business environments. We will closely examine the critical concepts surrounding Foundation Model Operations and Large Language Model Operations (FMOps/LLMOps), delving into the practical intricacies and challenges involved in deploying, monitoring, maintaining and scaling generative AI models in enterprise production systems. The discussion will comprehensively cover several critical topics such as optimal model selection, rigorous testing and evaluation, efficient training and fine-tuning techniques, retrieval augmented generation (RAG) architectures, and effective deployment strategies required for operationalization. Attendees will gain crucial and applicable insights into overcoming common obstacles frequently faced when attempting to deploy AI in live systems, including recommendations around managing resource-intensive models, ensuring ongoing model fairness and transparency, and strategically adapting to the continuously fast evolving AI landscape. To provide full perspective, the talk will also highlight relevant real-world examples and case studies, providing a comprehensive end-to-end view of the demanding practical requirements for true AI deployment. This presentation has been tailored for a wide audience encompassing AI and machine learning professionals, technology leaders, IT and DevOps teams, and anyone generally interested in better understanding the operational side of taking AI technology live. Whether you're looking to implement generative AI capabilities in your own organisation or working to enhance existing AI operations, this discussion will equip you with directly actionable knowledge and tools to successfully meet the challenges in navigating the world of FMOps/LLMOps.

11:35 am - 12:05 pm LLMOps

Accelerating the LLM Lifecycle on the Cloud

Luca Antiga CTO at Lightning AI

Session Abstract Coming Soon!

11:35 am - 12:05 pm MLOps

Beyond MLOps: Building AI systems with Metaflow

Ville Tuulos Co-founder and CEO at Outerbounds

Open-source Metaflow has been powering ML systems at companies like Netflix, 23andMe, and Goldman Sachs for years. With the advent of Generative AI, companies are starting to consider how the new techniques can be embedded in existing applications, and how they can power wholly new product experiences. This requires new engineering and infrastructure, in particular if the company wants to own the models and the user experience, integrating AI tightly into their business and systems, going beyond widely available commercial APIs. In this talk, we will provide an overview of how Metaflow helps you build novel, differentiated AI-powered systems that require large-scale data engineering, model training, content embedding, inference, and more. We will cover changes compared to earlier ML stacks, focusing on the quickly growing compute needs in particular, and share our recent experiences from real-life large-scale AI use cases, and how they interoperate with existing data and ML systems.

11:35 am - 12:05 pm Data Engineering

Data Pipeline Architecture - Stop Building Monoliths

Elliott Cordo Founder, Architect, Builder at Datafutures

DE Summit: In modern software development we have fully embraced microservice architecture, for good reason, but in data monoliths are accepted despite their pitfalls. Even when using the latest tooling associated with the “modern data stack” we very often end up creating monoliths, and almost always live to regret it. In small organizations, with small central teams we can get away with this architecture with limited discomfort for some time. In fact, like when developing any small software project, the monolith seems to save time, and gives the impression of higher productivity. But as complexity increases developer experience and productivity drops, and our system begins to get more brittle, frustrating both our engineering teams and stakeholders. Monolithic architecture is even more cumbersome in larger teams, especially in organizations that allow for federated data product development. So what’s the answer? How can we take inspiration from what’s been done in Microservices and Event Based Architecture? How can we apply some of the concepts of Data Mesh architecture? In this talk we will review how these patterns, and to what extent technologies can apply, starting from first principles and then working through the implementation patterns to common open source frameworks. This will include multi-Airflow infrastructure, micro-DAG packing and deployment, DBT multi-project implementation, rational use of containers, and data sharing/publication strategies. We will review some approaches for decomposing existing data monoliths, using a real world scenario.

11:35 am - 12:05 pm

The Unreasonable Effectiveness of an Asset Graph

Sean Lopp Sales Engineer at Dagster Labs

From hobbyist ML developers to platform architects at Fortune 100 firms, most data professionals spend the majority of their time answering simple questions: “If I update x, what else needs to be updated?” or “If y breaks, what else will be broken?”. In this demo, Sean Lopp, sales engineer at Dagster Labs will show how a global data lineage graph can answer all of these questions and become the highest leverage piece of a data platform. Best of all, he’ll show how organizations adopting Dagster as an orchestrator get this global lineage graph for free.

12:10 pm - 12:40 pm Data Engineering

Experimentation Platform at DoorDash

Yixin Tang Engineer Manager at DoorDash

DE Summit: The experimentation platform at DoorDash leverages multiple big data tools to help with thousands of decision making everyday. In this talk we will cover how company leverage the platform to make decisions in business strategies, machine learning models, optimization algorithms and infrastructure changes. We will also cover how the platform leverage Dagster to do metrics and analysis jobs orchestration; how the data storage and data fetching is done with datalake; how we enable exploratory analysis with Databrick notebook; and how we integrate with machine learning platform to make automated decisions.

12:10 pm - 12:40 pm Machine Learning

Flyte: A Production-Ready Open Source AI Platform

Thomas J. Fan Senior Machine Learning Engineer at Union.ai

12:10 pm - 12:40 pm

LLM Finetuning for Mere Mortals

Kevin Musgrave ML Developer Advocate at HPE

Solution Showcase: Everyone wants to use LLMs, and for good reason. With applications ranging from content creation to automated software development, LLMs have the potential to transform nearly every industry. How can we make the most of this technology when applying it to our own use-cases? Finetuning is one highly effective approach, but can be challenging to implement correctly. In this talk, you'll hear about what these challenges are, and how us mere mortals can tackle them using HPE's new software that leverages the open-source ML ecosystem.

12:10 pm - 12:40 pm

Empowering Analysts in Financial Services to Move Away From Excel: Venerable Case Study

Alexandria Morales-Garcia Investment Risk Analyst at Venerable
Steven Skarupa Sr Enterprise Architect at Venerable

Many businesses find themselves in a rut, with multiple, complex, often manual analytical processes inevitably built using Excel. These legacy processes are typically woven together into a spiderweb of files, spreadsheets, and databases. Often, this tangle is nearly impossible to unwind. Join us to discuss our journey moving our Excel-centric analytics to streamlined solutions using sophisticated approaches with SQL, Python, and Jupyter Notebooks. Learn how we started to transition our data team from spreadsheet ninjas to citizen developers. You will gain insights into how to facilitate your team's migration from Excel-centric analysis to more efficient and reliable analytical methodologies. This approach aims to foster a productive and collaborative work environment with minimal resistance.

12:10 pm - 12:40 pm

AI as an Engineering Discipline

Yucheng Low, PhD Co-founder & CEO at XetHub

The field of Artificial Intelligence (AI) has transformed over the last few decades, and has evolved from a deeply mathematical and theoretical discipline into a software engineering discipline. For the first time in history, AI is truly accessible. However, an open question is what is the right way to use AI? What are the engineering best practices around AI? In this talk we first briefly discuss how modern AI came about and how it has changed the rules of Machine Learning development. Then we will try to establish some new guidelines and engineering principles. that will allow you to cut through the noise of AI tooling, and assist in determining what is most effective for your tasks.

12:30 pm - 1:00 pm LLMs

Moving Beyond Statistical Parrots - Large Language Models and their Tooling

Ben Auffarth, PhD Author: Generative AI with LangChain | Lead Data Scientist at Hastings Direct

Large language models like GPT-4 and Codex have demonstrated immense capabilities in generating fluent text. However, simply scaling up data and compute results in statistical parroting without true intelligence. This talk explores frameworks and techniques to move beyond statistical mimicry. We discuss leveraging tools to retrieve knowledge, prompt engineering to steer models, monitoring systems to detect biases, and cloud offerings to deploy conversational agents. This talk explores the emerging ecosystem of frameworks, services, and tooling that propel large language models and enable developers to build impactful applications powered by large language models. Complex mechanisms like function calling and Retrieval Augmented Generation, navigating towards meaningful outputs and applications requires an overarching focus on strong model governance frameworks that can ensure that biases and harmful ideologies embedded in the training data are duly mitigated, paving the way towards beneficial application development. Developers play a crucial role in this process and should be empowered with tools and knowledge to steer these models appropriately. Intentional use of these elements not only optimizes model governance but also enriches the experience for developers, allowing them to dig deeper and create substantial applications that are not mere parroting, but stockholders of genuine value. From deploying conversational agents to crafting impactful applications across a swath of industries, such as healthcare and education, the comprehensive understanding and utilization of the vast array of LLM mechanisms can truly push the boundaries of NLP and AI, helping to usher in the age of AI in everyday life.

1:10 pm - 1:40 pm Deep Learning

AI Resilience: Upskilling in an AI Dominant Environment

Leondra Gonzalez Senior Data & Applied Scientist at Microsoft

The boom of generative AI and LLMs have taken the world by storm. This development has already disrupted various industries and roles, and data science is no exception to that rule. In a word of embeddings and transfer learning, one might beg to question "What should I learn next?" and "Where should I spend my time and energy for deep dives?". This talk aims to guide existing AI practitioners on how to maintain relevant skills in an increasingly automated world, and how to stand out in an oversaturated job market.

1:50 pm - 2:20 pm Data Engineering

Unlocking the Unstructured with Generative AI: Trends, Models, and Future Directions.

Jay Mishra Chief Operating Officer at Astera

DE Summit: The exponential growth in computational power, alongside the advent of powerful GPUs and advancements in cloud computing, has ushered in a new era of generative artificial intelligence (AI), transforming the landscape of unstructured data extraction. Traditional methods such as text pattern matching, optical character recognition (OCR), and named entity recognition (NER) have been plagued by challenges related to data quality, process inefficiency, and scalability. However, the emergence of large language models (LLMs) has provided a groundbreaking solution, enabling the automated, intelligent, and context-aware extraction of structured information from the vast oceans of unstructured data that dominate the digital world. This talk delves into the innovative applications of generative AI in natural language processing and computer vision, highlighting the technologies driving this evolution, including transformer architectures, attention mechanisms, and the integration of OCR for processing scanned documents. We will also talk about future of generative AI in handling complex datasets. Participants will gain insights into: The fundamental challenges and solutions in unstructured data extraction. The operational dynamics of Generative AI in extracting structured information. Future of generative AI in unstructured data extraction Practical insights into leveraging these technologies for real-world applications. Designed for data scientists, AI researchers, and industry professionals, the talk aims to equip attendees with the knowledge to harness the power of Generative AI in transforming unstructured data into actionable insights, thereby driving innovation and efficiency across industries.

2:00 pm - 2:30 pm Data Engineering

Data Mesh: where are we now?

Colleen Tartow, Ph.D. Field CTO and Head of Strategy at VAST Data

Data Mesh, the idea of a decentralized, domain-driven data architecture that puts data products at the forefront of an organization’s data strategy, took the data world by storm when it was introduced a few years ago. The idea of a prescriptive-yet-flexible methodology and people-first approach to data infrastructure design holds promise as businesses look to adopt a technology stack that will help them govern and serve data to consumers, with the ultimate goal of monetizing proprietary data. This talk will delve into the current state of Data Mesh adoption and implementation across various industries. It will explore the evolution of the Data Mesh concept since its inception, examining both its successes and challenges. Drawing from real-world case studies and industry trends, we will assess the practical implications of adopting a Data Mesh architecture, including its impact on organizational structure, technology stack, and data governance. Furthermore, the talk will address key questions and considerations for organizations looking to embark on their Data Mesh journey. Topics to be discussed include identifying suitable use cases for Data Mesh, overcoming cultural barriers to adoption, selecting appropriate tools and technologies, and ensuring data quality and security within a decentralized data ecosystem. With the recent addition of AI - specifically Deep Learning and Large Language Models (LLMs) - clear data strategies like Data Mesh become even more pertinent, as organizations seek scalable and efficient ways to leverage the truly vast amounts of data required for training and deploying such models. The emergence of LLMs emphasizes the importance of treating data as a product, as data products evolve into the foundational elements for training and refining these advanced algorithms. Additionally, in the context of stringent regulations and increasing demands for auditability, a cohesive data strategy is paramount. Organizations must ensure that data is not only accessible and actionable but also protected and compliant with regulatory requirements, by being subject to rigorous governance and auditing processes.

2:00 pm - 2:30 pm ML Safety & Security

Overcoming the Limitations of LLM Safety Parameters with Human Testing and Monitoring

Peter Pham Senior Program Manager at Applause
Josh Poduska AI Advisor at Applause

Ensuring safety, fairness, and responsibility has become a critical challenge in the rapidly evolving landscape of Large Language Models (LLMs). This talk delves into a new approach to address these concerns by leveraging the power of human testing and monitoring from a diverse global population. We present a comprehensive strategy employing a combination of crowd-sourced and professional testers from various locations, countries, cultures, and life experiences. Our approach thoroughly scrutinizes LLM and LLM application input and output spaces. It ensures responsible and safe product delivery. The presentation centers on functional performance, usability, accessibility, and bug testing. We share our research into these approaches and include recommendations for building test plans, adversarial testing approaches, and real-world usage scenarios. This diverse, global, human-based testing approach is a direct solution to the issues raised in recent papers highlighting the limited effectiveness of RLHF-created safety parameters against fine-tuning and prompt injection. Experts are calling for LLMs that inject safety parameters at the base parameter level, but, to date, this has resulted in a significant drop in LLM efficacy. Additionally, building safety directly into the pre-trained model is prohibitively expensive. Our approach overcomes these technical and financial limitations and is applicable now. Results point to a paradigm shift in LLM safety practices, yielding models and applications that remain helpful and harmless throughout their lifecycle.

2:00 pm - 2:30 pm Responsible AI

How AI Impacts the Online Information Ecosystem

Noah Giansiracusa, PhD Associate Professor of Mathematics and Data Science at Bentley University

Through concrete examples and a high-level conceptual overview, I'll discuss the various ways---both good and bad---that AI is impacting our online information ecosystem. This includes creation of mis/disinfo (LLMs, deepfake video/audio), propagation of mis/disinfo (search rankings, social media algs), the funding of disinfo (the targeted advertising industry), and AI-assisted fact-checking and bot detection/deletion.

2:00 pm - 2:30 pm

Building production-grade ML/AI Systems with Outerbounds Platform

Ville Tuulos Co-founder and CEO at Outerbounds

Building real-life, production-grade ML/AI systems is not easy. To build systems powered by RAG, custom LLMs, other GenAI patterns, or systems powered by classic predictive ML, you need access to data, compute, orchestration, versioning, modeling, and deployment tools. And you need these easily accessible for data scientists and ML Engineers, all while making sure your platform and infrastructure engineers have their needs met. In this demo, we will provide an overview of how to build novel, differentiated AI-powered systems that require large-scale data engineering, model training, content embedding, inference, and more, all using open-source Metaflow, originally developed at Netflix, and running on the secure, managed platform provided by Outerbounds.

2:35 pm - 3:05 pm MLOps & Data Engineering

Highly Scalable Inference Platform for Models of Any Size

Yuan Tang Principal Software Engineer at Red Hat

In recent years, advances in ML/AI have made tremendous progress yet designing large-scale data science and machine learning applications still remain challenging. The variety of machine learning frameworks, hardware accelerators, cloud vendors as well as the complexity of data science workflows brings new challenges to MLOps. One particular challenge is that it’s non-trivial to build an inference system that’s suitable for models of different sizes, especially for LLMs or large models in general. This talk presents various best practices and challenges on building large, efficient, scalable, and reliable AI/ML model inference platforms using cloud-native technologies such as Kubernetes and KServe that are production-ready for models at any size.

2:35 pm - 3:05 pm LLMs

Tracing In LLM Applications

Amber Roberts ML Growth Lead at Arize AI

According to a recent survey, 61.7% of enterprise engineering teams now have or are planning to have an LLM app in production within a year – and over one in ten (14.7%) are already in production, compared to 8.3% in April. With a record pace of adoption, the practice of troubleshooting and observing LLM apps takes on elevated importance. For software engineers that work with distributed systems, terms like “spans,” “traces,” and “calls” are well known. But what might these terms mean in a world where foundation models dominate? Since LLM observability isn’t just about tracking API calls, but about evaluating the LLM’s performance on specific tasks, there are a variety of span kinds and attributes that can be filtered on, in order to troubleshoot a LLM’s performance. Hosted by Amber Roberts – a data scientist, ML engineer and astrophysicist and former Carnegie Fellow – this session will focus on best practices for tracing calls in a given LLM application by providing the terminology, skills and knowledge needed to dissect various span kinds. Informed by work with dozens of enterprises with LLM apps in production and research on what works, attendees can learn span types and how to view traces from a LLM callback system and establish troubleshooting workflows to break down each call an application is making to an LLM. The session will explain and dive into both top-down workflows (starting with the big picture of the LLM use case and then getting into specifics of the execution if the performance is not satisfactory) and bottom-up workflows (discovery workflow where you are at the local level to filter on individual spans).

2:35 pm - 3:05 pm

Low-Code, High Impact: Kickstart Your Data Analytics Journey with KNIME

Roberto Cadili Data Scientist on the Evangelism Team at KNIME

Solution Showcase: The success of data science teams heavily relies on their chosen tools. While algorithmic expertise and domain wisdom are vital, the success of a data science project depends on additional contingent factors linked to the tool, such as costs, ease and time of learning, rapid prototyping, robust debugging and testing, flexibility, effective support, automation and security. In this talk, we’ll introduce you to KNIME Analytics Platform, the free and open-source data analytics software that relies on a low-code/no-code visual interface to enable professionals from any field to make sense of data. It features extensive data access & blending, wrangling, modeling and visualization capabilities, making it comprehensive and versatile for all stages of the data science life cycle. KNIME’s free and open-source nature eliminates licensing and budget concerns, and favors smooth integration with other technologies and scripting languages. The platform’s visual GUI allows quick prototyping and easy implementation of analytics pipelines via drag-and-drop data operation blocks. Together, we’ll experience KNIME Analytics Platform in action and build a simple AI-driven application, leveraging the generative power of API-based and local LLMs. Designed for simplicity, scalability and to address data science needs of any level of complexity, KNIME aims to drive open innovation and empower users with cutting-edge technologies in the evolving data tools landscape.

2:35 pm - 3:05 pm Data Engineering

The Future is Composable: How we to Build a Data Lakehouse from Open Source and Survive

Ciro Greco CEO and Founder at Bauplan

DE Summit: The design of data stacks began to move from monolithic applications to more composable systems. Over the past decade, data lakes became ubiquitous in enterprises and an increasing number of them choose to use cloud object storage (e.g. AWS S3). This trend favored the rise of more flexible architectures built on separation of storage from compute, such as the Data Lakehouse (DLH). The DLH aims at providing the user experience of OLAP monolithic systems with the flexibility of the data lake. It also promises to go beyond analytics and BI, enabling first-class developer experience also for data transformation, data science and ML. In this talk, we describe how we built a serverless DLH by leveraging open standards and existing open source frameworks. In the last years, open source tools for data systems became more reliable and the industry started to think in terms of composability of data systems. Instead of simply trying to improve the usability of all-purpose Big Data technologies, such as Spark, we chose to repurpose open components to support different use cases. This approach allowed us to build a more flexible system and to focus on developer experience. We built upon the following foundational principles: Storage is built on cloud object storage and open formats (Parquet and Iceberg), to ensure interoperability with other system. Different computational engines should be available based on the use cases: for instance, it is desirable to avoid the JVM in medium sized workloads. User interfaces vary depending on users: for instance, developers should be able to use either SQL or Python or both, depending on what they are trying to accomplish. We were able to leverage open standards, such as Apache Arrow for in-memory columnar representation, Iceberg for metadata representation and Parquet for storage, and open source projects such as DuckDB for query engine and SQL support. While building an entire DLH remains extremely hard, the progress made by open formats and open source frameworks allowed us to build faster, narrow down the problem scope and focus our resources on differentiating features and components that truly required to be built from scratch.

2:50 pm - 3:20 pm Data Engineering

Data Engineering in the Age of Data Regulations

Alex Gorelik Distinguished Engineer at LinkedIn

DE Summit: Continuous data regulations like GDPR, CCPA, DMA and many others are giving control to users over how their data is used and imposing restrictions on what companies can do with user data. This talk will focus on LinkedIn's approach to converting these regulations into policies and integrating policy enforcement in data engineering practices using our Policy Based Access Control (PBAC) system. It will cover how to annotate data, features, pipelines and models; how to integrate model training and inferences with the PBAC system; and how to enforce policies. It will describe the architecture and components of LinkedIn's governance system and various tools used to automate the annotation and enforcement process.

3:10 pm - 3:40 pm Data Engineering

Deciphering Data Architectures (choosing between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh)

James Serra Data & AI architect at Microsoft

DE Summit: Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they’re also surrounded by a lot of hyperbole and confusion. In this presentation I will give you a guided tour of each architecture to help you understand its pros and cons. I will also examine common data architecture concepts, including data warehouses and data lakes. You’ll learn what data lakehouses can help you achieve, and how to distinguish data mesh hype from reality. Best of all, you’ll be able to determine the most appropriate data architecture for your needs. The content is derived from my book Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh.

3:30 pm - 3:40 pm ML Safety & Security

I don't always secure my ML models, but when I do...

Hailey Buckingham Director of Data Science at HiddenLayer

Cyber attacks against ML and AI systems are becoming more and more frequent. Public, open source ML models are essentially code-as-data, which puts our organizations at risk. But whose responsibility is it to secure these systems? ML Operations and Engineering teams already split their time between operationalizing ML systems and researcher enablement. Adding security workloads might seem like a step too far. However, there are many benefits that ML teams can yield by taking part in security concerns which may make the effort well worth it, not only for the overall organization but for the ML team themselves. Spending cycles on security hardening is far from desirable for most ML Operations and Engineering teams. At first glance, engaging in an entirely new discipline would seem like folly given the already diverse set of disciplines ML and AI projects require. Furthermore, shifting security operations responsibilities onto teams which likely have little or no security training should reasonably raise at least one eyebrow. But looking a little deeper, it turns out that there are a lot of good reasons for ML Engineers, ML Ops teams, and even Data Scientists to participate in security thinking and planning. From deeper understanding of the ML systems themselves, to insights into user behavior, to reinforcing good operational habits, the benefits to the ML-based teams are plentiful. And that’s even before an ML-based security event comes into play. In this talk we’ll dive into each of these areas in detail. We’ll discuss how using security tools specifically designed for AI can precipitate a number of additional benefits which are likely already on the ML teams’ wishlist. These same tools will also help increase collaboration with security teams and improve the organization’s security posture.

3:30 pm - 4:00 pm Machine Learning

Machine Learning Across Multiple Imaging and Biomarker Modalities in the UK Biobank Improves Genetic Discovery for Liver Fat Accumulation

Sumit Mukherjee Staff Machine Learning Scientist at Insitro

Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD), a condition where the liver contains more than 5.5% fat, is a major risk factor for chronic liver disease, affecting an estimated 30% of people worldwide. Although MASLD is a genetically complex disease, large- scale case-control cohort studies based on MASLD diagnosis have shown only limited success in discovering genes responsible for MASLD. This is largely due to the challenges in accurately and efficiently measuring the disease characteristics, which is often expensive, time-consuming, and inconsistent. In this study, we showcase the power of machine learning (ML) in addressing these challenges. We used ML to predict the amount of fat in the liver using three different types of data from the UK Biobank: body composition data from dual-energy X-ray absorptiometry (DXA), plasma metabolites, and a combination of anthropometric and blood-based biochemical markers (biomarkers). For DXA-based predictions, we used deep learning models, specifically EfficientNet-B0, to predict fat content from DXA scans. For predictions based on metabolites and biomarkers, we used a gradient boosting model, XGBoost. Our ML models estimated that up to 29% of participants in the UK Biobank met the criteria for MASLD, while less than 10% received the clinical diagnosis. We then used these estimates to identify regions of the genome associated with liver fat, finding a total of 321 unique regions, including 312 new ones, significantly expanding our understanding of the genetic determinants of liver fat accumulation. Our ML-based genetic findings showed a high genetic correlation with clinically diagnosed MASLD, suggesting that the genetic regions we identified are also likely to be relevant for understanding and diagnosing the disease in a clinical setting. This strong correlation underscores the potential of our approach to contribute to real-world medical applications. Our findings highlight the value of ML in identifying disease-related genes and predicting disease risk, demonstrating its potential to enhance our understanding of complex diseases like MASLD. This study highlights the potential of data science to help transform healthcare research and improve patient outcomes.

3:30 pm - 4:00 pm Data Engineering

Data Engineering in the Era of Gen AI

Anindita Mahapatra Solutions Architect at Databricks

In the era of Gen AI, the landscape of data engineering is undergoing a transformative evolution, and this talk delves into the pivotal role it plays in harnessing the power of artificial intelligence. The session explores the dynamic interplay between data engineering and the emerging generation of AI technologies, highlighting key strategies to adapt and thrive in this data-driven era. The discussion begins by examining the unique challenges and opportunities posed by Gen AI, where advanced machine learning algorithms and neural networks demand a sophisticated and scalable data infrastructure. The speaker emphasizes the importance of building resilient pipelines that can seamlessly integrate diverse and massive datasets, ensuring a robust foundation for training and deploying AI models. The talk also delves into the crucial aspect of data quality and governance in the context of Gen AI, emphasizing the need for meticulous data engineering practices to mitigate biases and ensure ethical AI development. Furthermore, the session explores cutting-edge technologies and best practices, such as real-time data processing and federated learning, that empower data engineers to stay at the forefront of innovation. Ultimately, this talk serves as a comprehensive guide for data engineers navigating the complexities of Gen AI, offering insights, strategies, and real-world examples to inspire and equip professionals in the rapidly evolving field of data engineering.

3:30 pm - 4:00 pm Machine Learning

Metrics & Visualizations for Evaluating Synthetic Data Quality

Neha Patki Co-Founder & Head of Product at DataCebo
Srini Kadamati Staff Developer Advocate at DataCebo

Synthetic data has shown great promise for solving a variety of problems like addressing data scarcity for AI and overcoming barriers to data access. But the field of synthetic data generation is still extremely nascent and we haven’t converged on a set of common benchmarks for evaluating the quality of synthetic data. Our team originally came from MIT’s Data-to-AI Lab and we’ve spent years researching and collecting the best metrics for evaluating synthetic data quality like CategoricalCAP, Boundary Adherence, and more. Learning Objectives Learn the basic approach of evaluating synthetic data by comparing columns with your original data. Most of the data in organizations and business is structured, relational, and tabular. Learn about the unique problems that synthetic data generation can solve based on our experience helping thousands of individuals work with synthetic data. Choosing the right synthetic data quality metrics isn’t easy and is tied closely to the goal of your project. We’ll showcase our recommended framework, which incorporates the context & expertise of domain experts and specific, interpretable statistical measures. Learn which metrics and visualizations you should use for each data type. What are the most common pitfalls and mistakes people make when generating synthetic data? Takeaways Statistical measures are necessary but insufficient for evaluating synthetic data. Domain expertise is important for defining business rules that your data should follow, independent of just the quality score itself. Using side-by-side visualizations of quality scores can help communicate synthetic data quality to your stakeholders and collaborators. The goals of a project play a big factor in how you evaluate the quality of synthetic data. When evaluating synthetic data, avoid common statistical pitfalls. For example, it’s tempting to rely on correlation between columns in the original data and synthetic data but often the linearity assumption is violated. Tools Plotly and SDMetrics, both completely open source (MIT licensed) Examples of visualizations we’ll showcase are here and here.

3:40 pm - 3:50 pm Big Data Analytics

Building Knowledge Graphs

Sumit Pal Strategic Technology Director at Ontotext

Knowledge graphs are all around us and we are using them everyday. Lot of the emerging Data management products like Data Catalogs/Fabric, MDM products are leveraging Knowledge Graphs as their engines. A knowledge graph is not a one-off engineering project. Building a KG requires collaboration between functional domain experts, data engineers, data modelers and key sponsors. It also combines technology, strategy and organizational aspects; focusing only on technology leads to a high risk of a KG’s failure. KGs are effective tools for capturing and structuring a large amount of structured, unstructured and semistructured data. As such, KGs are becoming the backbone of different systems, including semantic search engines, recommendation systems, and conversational bots and data fabric. This session guides data and analytics professionals to show the value of Knowledge Graphs and how to build build semantic applications.

3:30 pm - 4:00 pm ML Safety & Security

How to Preserve Exact Attribution through Inference in AI: Get the Correct Explanations and Preserve Privacy via Instance-Based Learning

Chris Hazard, PhD CTO and Co-founder at Howso

Most forms of machine learning explainability are ex-post; they attempt to create an approximate model of a model in order to try to understand why a prediction was made. For data scientists working with AI models today, that won’t cut it. There is an increasing need for full data transparency and explainability to mitigate against bias, incorrect information, and hallucinations — as well as increasing demands for privacy. In this session, hear from noted computer scientist and AI expert, and founder of a leading explainable AI company, Dr. Chris Hazard. He will show data practitioners how to leverage cutting-edge instance-based learning (IBL) to solve these problems. Most AI today is black box. IBL offers a fully explainable AI alternative, having a precise on/off switch for data provenance and lineage through inference. With IBL, the derivation of each inference can be easily understood from the data. Having worked with IBL for over a decade, Chris will explain how modern IBL techniques, built around information theory, have modern model performance characteristics. He will also show how IBL techniques have extremely strong robustness to adversarial attacks and are automatically calibrated. Attendees will learn how the same mechanisms that yield this performance are closely related to differentially private mechanisms, and how to deploy them to generate strongly private synthetic data at scale. Hearing practical examples, attendees will learn why attribution through inference is vitally important for data-centric AI, how to debug data and understand outcomes, and how to protect privacy and anonymity when it matters.

3:50 pm - 4:00 pm Machine Learning

Integrating Data Science and MLOps: How to structure a collaboration and handoff process

Thomas Loeber Senior Machine Learning Engineer at Logic20/20

While there has been immense progress in developing increasingly powerful ML and AI models, many organizations still struggle with productionalizing even basic models. Most of the recent progress in how to reliably operate ML models in production has come from the emerging field of MLOps. However, this is giving rise to the new challenge of how to integrate MLOps into the traditional data science workflow. This presentation starts by framing the problem as an inherent conflict that arises from the fundamentally different needs of the explorative and interactive workflow in data science, compared to the *engineering* mindset required to manage the complexity of software systems running in production. From this perspective, it becomes clear that this dilemma can’t simply be solved by imposing a *common* set of best practices. Instead, we need to define a different set of quality standards for each side, and then find a good process for handing off work from data scientists to MLOps engineers. There are three main categories of work that need to be handed over: code, models, and data. For each, I discuss the specific challenges involved, and suggest concrete strategies to overcome these. The final section delves into general recommendations for structuring a successful handoff process. A particular focus is on how to reduce the gap between data scientists and MLOps engineers in the first place by building in mutual collaboration throughout the ML lifecycle. Most importantly, I suggest locating both sides on the same team, and identify specific points in the workflow where collaboration is most beneficial.

4:05 pm - 4:20 pm

The Diagram-based Deep Learning Framework that Brings Visibility to Model Architectures

David Winer Co-founder at Cerbrec

No doubt that the transformer model architecture has revolutionized AI, but working with these models has an ever expanding set of technical challenges and long learning curves. In this case study, we will navigate the Llama-2 model architecture and steer fine-tuning for customized document summarization. We will use Graphbook, the deep learning framework that brings visibility and intelligent guidance to researchers focusing on model architectures.

4:05 pm - 4:35 pm Deep Learning

Preserving Digital Privacy: AI-powered Tools for Authorship Anonymization

Hemanth Kandula Research Engineer II at Raytheon BBN Technologies

Many people are concerned about leaving digital traces online that might be attributed to them or used against them in ways they didn't intend. Authorship of anonymous online texts has until recently not been a primary concern, despite some high-profile ""sock puppet"" cases. Recently Authorship attribution, the process of identifying the author of a text, has gained significant attention due to its implications in various domains, from forensic analysis to literary studies. The authorship of texts like the QAnon messages still seems to be a matter of great speculation. With the advent of data science technologies and powerful GPUs, authorship attribution tools are now being scaled up in ways that had not previously been feasible. These tools, powered by advanced natural language processing (NLP) techniques, can potentially identify the authors of anonymous texts, even when efforts are made to obscure stylistic elements. In this talk, I will outline the current capabilities of web-scale authorship attribution and present our recent research on innovative methods for authorship obfuscation. These methods protect user privacy by concealing stylistic writing traits without altering the underlying message. Using the latest in NLP techniques, a combination of large language models (LLMs), reinforcement learning, and unique orthography, our research showcases how current authorship attribution tools can be eluded by authorship obfuscation tools that eliminate stylistic traits while preserving meaning. I will present ways to measure obfuscation, meaning, and fluency preservation to evaluate these novel applications of AI. As we navigate the fine line between technological advancement and ethical considerations, the ramifications of these developments on privacy, security, and freedom of expression are profound. I will conclude with a forward-looking discussion on the future of authorship attribution and obfuscation technologies, highlighting the need for a balanced approach that safeguards individual privacy while maintaining the integrity of online content. This talk is designed for data science professionals interested in the intersection of AI, linguistics, and cybersecurity, offering insights into the latest research and practical applications in safeguarding digital anonymity.

4:10 pm - 4:20 pm Generative AI

How to Rigorously Evaluate GenAI Applications

Pasquale Antonante, PhD Co-Founder and CTO at Relari AI

Evaluating GenAI applications is hard. Non-deterministic nature of LLMs and other components make them difficult to measure and improve. In this session, we will walk through the State of the Art techniques of LLM evaluation and best practices in industry.

4:05 pm - 4:35 pm Machine Learning

Fallacy of Scale

Trevor Back Chief Product Officer at Speechmatics

The leaps in AI made over the last few years – particularly LLMs – have been achieved with scale, i.e., training models on increasingly large datasets. Whether LLMs, like GPT4, can be improved with another run is a question of scale. The current approach – train models on huge datasets – has reliably delivered impressive results. But the quality of any LLM is determined by the size of the dataset it’s trained on and there’s a limit to those datasets, even if they are the size of the internet. The wider industry has been calibrated to believe that more data equals improved models, so we chase bigger and bigger runs and will likely see the first $1bn run within a year. But there’s a limitation to this approach – the data itself. Many communities lack datasets of a comparable size to those in English, and even that has its limitations. The day is approaching when scale alone won’t be enough to deliver meaningful advances. Efficient learning is a key component to true intelligence. Therefore, a focus on efficiency – learning deeper understanding from smaller data – is becoming increasingly important as scale reaches its limits to growth. With increased efficiency, it will be possible to continue the rapid advancement of AI, and potentially even more capable and intelligent models as higher levels of abstraction and representations can be learned. To create the next generation of intelligent algorithms that can deliver for less well-represented communities besides those that speak in English, we will need continued progress in efficient learning mechanisms and methods. Learning outcomes: · Why a focus on algorithmic sample efficiency is required to enable further advancement in AI · The advantages that efficient learning models can provide, from removing blockers to the development of theory of mind to increasing less well represented communities in speech tech. · Why LLMs are just the beginning, and the applications for speech technology once models can truly understand intent. · What are the next generation of intelligent systems and what it’ll take to get us there.

4:20 pm - 4:35 pm

Generative AI Applications for Pharmaceutical Companies ~From R&D to Quality Assurance and Marketing DX~

Rei Araki CEO at GenerativeX

Using generative AI, pharma companies can efficiently manage extensive clinical data, ensure quality assurance, and support medical representatives. GenerativeX creates pharma company-specific AI to quickly achieve valuable digital transformations.

4:40 pm - 5:10 pm Machine Learning

Vector Embeddings: The Emerging Language of AI

Pawel Zimoch CTO at Featrix

In the rapidly evolving landscape of machine learning and artificial intelligence, a foundational element has emerged as a cornerstone for advancements in processing data of all types: vector embeddings. This talk demystifies vector embeddings for developers and data scientists. Vector embeddings are the intricate mathematical representations that enable machines to interpret and manipulate data – from text to digital images and beyond. Vector embeddings are increasingly recognized for their critical role in the development of generative AI and sophisticated machine learning models. Our discussion will explore the universal appeal of vector embeddings as a 'language' of AI, delve into their construction, and highlight their diverse applications. Attendees will gain insights into the creation of embeddings and learn the processes that translate complex data into formats suitable for machine learning. We will discuss the significance of embeddings in familiar domains such as natural language processing and computer vision, exemplified by technologies like Google's BERT and OpenAI's image generation models. Moreover, the session will cover various uses of embeddings beyond increasingly common LLM applications, offering a glimpse into their potential to revolutionize all AI-driven solutions. A key focus will be on practical knowledge, including where to source embeddings, the distinctions between different models and approaches, and strategies for leveraging embeddings in machine learning projects. We will introduce four mental models to aid in conceptualizing embeddings and maximizing their utility: as interfaces for neural network composition, tools for dimensionality reduction, trainable abstractions, and search indices for databases. This talk will equip the audience with an understanding of vector embeddings, empowering them to harness this technology in crafting innovative AI and machine learning projects. Join us to explore how vector embeddings are shaping the future of AI and how you can leverage them to unlock new possibilities in your work.

4:40 pm - 5:10 pm Data Engineering

Record Level Data Lineage with Trino

Pankaj Yawale Senior Principal Architect at Zoominfo
Ethan Peck, PhD Director, Data Engineering at Zoominfo

In a world with Gen-AI, it is becoming increasingly necessary to be able to track the lineage of data. A crucial aspect of data lineage is tracking the movement and transformation of data records. Data lineage is necessary for Data Operations and Governance that support incident response, legal investigations, and privacy and compliance standards. However, things can go wrong due to the proprietary hand-coded business logic that alters the data and obfuscates provenance. When that happens, the current data lineage systems that operate at the dataset/table level are not very helpful. They require additional analysis effort that can be expensive and time-consuming. In today's data landscape, we need a record-level lineage that can pinpoint the exact source and cause of data issues with minimal manual intervention. This problem has long been neglected due to its complexity, but we have a solution to propose. In this presentation, we will introduce a novel concept and its reference implementation using Starburst Enterprise Platform, which is backed by the c technology. Attendees will learn about some scenarios where record level lineage can be important, as well as a methodology to track record level lineage using Trino.

4:40 pm - 5:10 pm Data Visualization

Practical Strategies for Data Storytelling

Ryan Harter Senior Staff Data Scientist at Shopify

How can we influence decisions when our stakeholders are tired, busy, and distracted? In this talk I'll share the strategies I've developed to build tight data stories that actually influence decisions. When I led the Executive Insights team at Shopify, I quickly learned that attention is scarce. This is a human limitation - even the most engaged audiences can only hold focus for about twenty-five minutes. More often, I try to land my story in less than five minutes. With practice, I developed tools to quickly build data stories that are easy to read and easy to share. This allows me to influence decisions at the executive level and throughout the company. For the last two years, I've honed these tools by running a Data Storytelling workshop for my peers at Shopify. Now, I'll share these tools with you. Here's a sample of some of the strategies we'll discuss: * The **1/5/15 minute rule**: build trust with your reader by making it clear whether your analysis is relevant to _them_. * **Don't show your work**: data scientists want the data to do the talking, but end up weighing down their presentation with too much detail. I'll show you how to build a rigorous presentation without sacrificing readability. * **Avoid the academic style** - a presentation is not a white paper. Understand your genre to write a great story. * **Make a mess, clean it up** - writing is a two-stage process. Trying to do it all at once leads to writer's block and anxiety. Following this pattern will make writing easy (almost). Telling a good story is critical to _impact_. Join me for thirty minutes to build your influence through better data storytelling.

9:00 am - 9:25 am

Social and Ethical Implications of Generative AI

Abeba Birhane Senior Fellow in Trustworthy AI at Mozilla Foundation | Adjunct Lecturer/Assistant Professor at Trinity College Dublin

As Artificial Intelligence systems pervade day-to-day life, the need for these systems to be robust, fair, accurate, and just has become of urgent importance. As the foundational backbone of AI systems, large scale datasets play a crucial role in the performance, accuracy, robustness, fairness and trustworthiness of AI systems. In this talk, I: a) present work that highlights numerous concerns arising from large scale datasets, b) discuss the downstream impact of such dataset on models (including the exacerbation of societal biases and negative stereotypes) and c) review some approaches to both incremental improvements as well as shepherding broader structural change.

9:00 am - 9:25 am

Algorithmic Auditing

Cathy O’Neil, Ph.D. Data Scientist and Author of the New York Times Best-seller Weapons of Math Destruction, CEO at ORCAA

Session Abstract Coming Soon!

9:30 am - 9:55 am LLMs

Deep Reinforcement Learning in the Real World: From Chip Design to LLMs

Anna Goldie Senior Staff Research Scientist | Google DeepMind

ODSC Keynote: Reinforcement learning (RL) is famously powerful but difficult to wield, and until recently, had demonstrated impressive results on games, but little real world impact. I will start the talk with a discussion of RL for Large Language Models (LLMs), including scalable supervision techniques to better align models with human preferences (Constitutional AI / RLAIF). Next, I will discuss RL for chip floorplanning, one of the first examples of RL solving a real world engineering problem. This learning-based method can generate placements that are superhuman or comparable on modern accelerator chips in a matter of hours, whereas the strongest baselines require human experts in the loop and can take several weeks. This method was published in Nature and used in production to generate superhuman chip layouts for the last four generations of Google’s flagship AI accelerator (TPU). Session Outline: deep reinforcement learning, RLHF, RLAIF, constitutional AI

9:30 am - 9:55 am

ODSC Keynote: Accelerating AI Adoption for DoD Decision Advantage

Dr. William (Bill) W. Streilein Chief Technology Officer at DoD Chief Digital and AI Office

This talk will provide an overview of the DoD's Chief Digital and AI Office and the strategic approach being pursued to accelerate the adoption of data, analytics, and AI for decision advantage. Aspects of global competition will be discussed in the context of rapid technology innovation and challenges and approaches to leveraging AI in the department will be considered. A model of a digital ecosystem that includes partners from industry, academia, and across the government, leverages agile experimentation, capability assurance and operational monitoring to chart a path forward and address current impediments to modernization, including the "valley of death." Special focus is paid to Generative AI and its implications for MLOps pipelines and department innovation uptake. In particular, CDAO's LLM Maturity Model will be discussed as a way to align department needs in generative AI with Industry capabilities and drive assured use forward.

10:00 am - 10:30 am LLMs

Large Language Models as Building Blocks

Jay Alammar Director, Engineering Fellow (NLP) at Cohere

Abstract Coming Soon!

10:00 am - 10:30 am Multimodal and Deep Learning

End-to-End Speech Recognition: The Journey from Research to Production

Tara Sainath, PhD Principal Research Scientist at Google DeepMind

End-to-end (E2E) speech recognition has become a popular research paradigm in recent years, allowing the modular components of a conventional speech recognition system (acoustic model, pronunciation model, language model), to be replaced by one neural network. In this talk, we will discuss a multi-year research journey of E2E modeling for speech recognition at Google. This journey has resulted in E2E models that can surpass the performance of conventional models across many different quality and latency metrics, as well as the productionization of E2E models for Pixel 4, 5 and 6 phones. We will also touch upon future research efforts with E2E models, including multi-lingual speech recognition.

10:00 am - 10:30 am

Being, Training, and Employing Data Scientists: Wisdoms and Warnings from Harvard Data Science Review

Dr. Xiao-Li Meng Founding Editor-in-Chief | Professor of Statistics at Harvard Data Science Review | Whipple V. N. Jones

Session Abstract Coming Soon!

10:00 am - 10:30 am

All Models Great and Small: It’s 2024, You Don’t Have to Use GPT-4 for Everything Anymore

John Dickerson, PhD Co-founder and Chief Scientist at Arthur

Last year, enterprises and hobbyists alike experimented with the most recent easy-to-use, broadly-available, general-purpose foundation models to solve any task at hand. In our experience, 99% of those enterprises and hobbyists were (and are) experimenting with GPT-3.5 and GPT-4. This year, as we move from Notebook demo GenAI-based applications into full production applications, those same users are realizing that this stuff is expensive, it’s slow, and that smaller, open-source, fine-tuned models can compete with or entirely outperform the bigger models. In 2023, you didn’t need to prove positive ROI on a GenAI application; in 2024, you absolutely do. It’s time to think about model selection. In this talk, we cover the pros, the cons, and the open questions around selecting a bleeding edge foundation model over one of many task-specific, smaller models. We’ll cover the cloud versus edge discussion that is happening across all enterprises, touching on latency vs accuracy tradeoffs and then diving deeper into how that intersects with fine-tuning task--specific models for your particular use case. A spoiler: we’ll conclude with a recommendation that sometimes the expensive OpenAI/Anthropic solution is worth it, but that (a personal prediction!) the Mistral, Nomic, Hugging Face, AI2, etc offering will rear its head to an astounding degree in the coming year.

10:10 am - 10:40 am Data Visualization

How to Become a True Dataviz Pro

Nick Desbarats Globally recognized educator and best-selling author at Practical Reporting Inc.

Many analytics, data science and AI initiatives involve presenting data to stakeholders in charts, however, many charts are poorly designed and leave audiences confused, unmoved, or misled—even when the chart creator wasn’t trying to confuse or mislead anyone. Even charts from high tech companies, universities, government agencies and major news media outlets often suffer these fates. Why do charts so frequently flop with audiences? Often, the reasons are surprisingly mundane: poor chart type choices, poor scale formatting choices, poor color choices, and a host of other basic design problems. These charts are like potentially great documents that a ruined by basic spelling and vocabulary mistakes. Like any other language, the language of data visualization has a “spelling and vocabulary,” that is, a set of skills and best practices that must be learned in order to communicate effectively in that language. The “spelling and vocabulary of data visualization” includes knowing how to choose chart types, scale ranges, and colors (among many other design choices), as well as knowing how to make charts obvious by highlighting key elements, adding key insights as chart titles and annotations, and adding comparison or reference values. If a chart creator hasn’t learned these basic skills, they’re at high risk of producing ineffective (and potentially misleading) charts. In this eye-opening talk, globally recognized data visualization educator and best-selling author Nick Desbarats explains exactly what it takes to learn the basic “spelling and vocabulary of data visualization,” and how to become a true data visualization pro, able to design clear, compelling charts every time. Attendees should have experience creating basic charts in a data visualization application (Excel, Tableau, Qlik, etc.)

10:50 am - 11:20 am LLMs

Data Automation with LLM

Rami Krispin Senior Manager - Data Science and Engineering at Apple

In today's business environment, data plays a crucial role in decision-making. However, obtaining the required data can be challenging due to data engineering or data science resource constraints, leading to delays, inefficiency, and potential losses. This talk will focus on creating a self-serve bot (e.g., Slack bot) that can serve data requests and support ad-hoc requests by leveraging LLM applications. This involves building a natural language to SQL engine using tools such as OpenAI API or open-source models that leverage the Hugging Face API.

11:00 am - 11:30 am Data Visualization

Data Morph: A Cautionary Tale of Summary Statistics

Stefanie Molin Data Scientist, Software Engineer, Author of Hands-On Data Analysis with Pandas at Bloomberg

Statistics do not come intuitively to humans; they always try to find simple ways to describe complex things. Given a complex dataset, they may feel tempted to use simple summary statistics like the mean, median, or standard deviation to describe it. However, these numbers are not a replacement for visualizing the distribution. To illustrate this fact, researchers have generated many datasets that are very different visually, but share the same summary statistics. In this talk, I will discuss """"""""Data Morph"""""""" (https://github.com/stefmolin/data-morph), an open source package that builds on previous research from Autodesk (the """"""""Datasaurus Dozen"""""""" (https://damassets.autodesk.net/content/dam/autodesk/research/publications-assets/pdf/same-stats-different-graphs.pdf)) using simulated annealing to perturb an arbitrary input dataset into a variety of shapes, while preserving the mean, standard deviation, and correlation to multiple decimal points. I will showcase how it works, discuss the challenges faced during development, and explore the limitations of this approach.

11:00 am - 11:30 am LLMs

Training an OpenAI Quality Text Embedding Model from Scratch

Andriy Mulyar Founder & CTO at Nomic AI

Text embeddings are an integral component of modern NLP applications powering retrieval-augmented-generation (RAG) for LLMs and semantic search. High quality text embeddings models are closed source and access to them is gated via the API's of leading AI companies. This talk describes how Nomic AI trained nomic-embed-text-v1 - the first fully auditable open-data, open-weights and open-training code text embedding model that outperforms the performance of OpenAI Ada-002. You will learn how text embedding models are trained, the various training decisions that impact model capabilities and tips for successfully using them in your production applications.

11:00 am - 11:30 am Deep Learning

Trial, Error, Triumph: Lessons Learned using LLMs for Creating Machine Learning Training Data

Matt Dzugan Director of Data at Muck Rack

We've all been in situations where we'd like to build a model but lack the labeled training data to do so. I plan to discuss how the advent of Large Language Models (LLMs) like GPT-4 has opened new avenues for generating training data. Traditionally, the creation of NLP datasets relied heavily on manual, crowdsourced handlabeling, often resorting to platforms like Mechanical Turk. This approach, while effective, presented significant challenges in terms of cost, time, and scalability. In this talk, I will share a comprehensive narrative of our journey from initial trials and errors to eventual triumphs in using LLMs for NLP data generation. The shift from manual to AI-assisted data creation marks a pivotal change in how we approach NLP model training. My team and I navigated through various challenges, experimenting with different strategies and learning valuable lessons along the way. I will discuss how we harnessed the power of LLMs to generate vast amounts of diverse, nuanced data, significantly reducing the time and cost compared to traditional methods. The talk will cover practical insights into fine-tuning these models for specific domains, ensuring data quality, and avoiding common pitfalls such as biases and overfitting. Moreover, I will highlight how LLMs can be creatively used to simulate real-world scenarios, providing richer and more contextually relevant training data. This not only improves the performance of traditional NLP models but also opens up possibilities for exploring new problem spaces within NLP. Attendees will leave with a deeper understanding of the potential and limitations of using LLMs in NLP data generation. They will gain actionable insights and strategies that can be applied in their own NLP projects, accelerating their journey from trial to triumph in the realm of AI-powered data science.

11:00 am - 11:30 am

10 Quick Wins To Expedite Your Job Search

Adam Ross Nelson Data Scientist + Career Coach at Up Level Data, LLC

Career Talk: In today's competitive job market, efficiency is paramount in finding your next opportunity. This high-speed fast paced talk provides attendees with concise, actionable strategies that can help job-hunters press fast forward. This presentation includes distillation of 10 specific action items. We begin by understanding the power of reconnecting with former bosses, co-workers, and supervisees. We explore the value of former classmates and schoolmates. We look at the power of updating friends and family about your career goals which can uncover hidden opportunities. A strategic approach to job searching includes strategically sharing job opportunities for others. The goal is to attract offers. If you’re tired of “personal branding” and “networking” this talk is for you. Each strategy is designed to be a quick win; simple to implement but with the potential for substantial impact. Job seekers from all backgrounds will leave this talk with a toolkit of techniques to not only expedite their job search but to do so with a targeted and effective approach. Whether you're a recent graduate or in the midst of a career transition, these insights are tailored to help you navigate the complexities of the job market and emerge successfully. Join us to transform your job search into a dynamic and results-driven journey.

11:30 am - 12:00 pm NLP

Applying Responsible Generative AI in Healthcare

David Talby, PhD Chief Technology Officer at John Snow Labs

The past year has been filled with frameworks, tools, libraries, and services that aim to simplify and accelerate the development of Generative AI applications. However, a lot of them do not work in practice, on real use cases and dataset. This session surveys lessons learned from real-world projects in healthcare that created a compelling POC and only then uncovered major gaps from what a production-grade system will require: 1. Fragility and sensitivity of current LLMs in minor changes to both datasets and prompts and their accuracy impact. 2. Where guardrails and prompt engineering fall short in addressing critical bias, sycophancy, and stereotype risks. 3. The vulnerability of current LLM’s to known medical cognitive biases such as anchoring, ordering, and attention bias. This session is intended for practitioners who are building Generative AI systems in Healthcare and need to be aware of the legal and reputation risks involved and what can be done to mitigate them.

11:35 am - 12:05 pm ML for Biotech and Pharma

Harnessing Machine Learning to Understand SARS-CoV-2 Variants and Hospitalization Risk

Tomasz Adamusiak, MD, PhD Chief Scientist, Clinical Insights & Innovation Cell at MITRE

In this session, we will delve deep into the transformative potential of Machine Learning (ML) in the BioTech and Pharma industry. This talk will provide a comprehensive overview of how ML can be harnessed to accelerate drug discovery, enhance personalized medicine, improve patient outcomes, and drive innovation. We will explore real-world applications, focusing on a case study that involves the analysis of SARS-CoV-2 genetic variants and their association with hospitalization risk. This will provide attendees with a practical understanding of how ML can be applied to complex biological and medical data to derive actionable insights. The session will provide a detailed walkthrough of the use of ML models like XGBoost and analytical techniques like SHapley Additive exPlanations (SHAP) analysis. In addition to exploring these tools and techniques, we will also discuss the challenges that come with integrating ML into existing bioinformatics workflows.

11:35 am - 12:05 pm Data Engineering

What it Takes to Stabilize a GenAI-first, Modern Data Lake in a Big Company: Provision 20,000 Ephemeral Data Lakes Annually

Moses Lee Staff Software Engineer at LinkedIn

LinkedIn, having joined the exabyte-scale data lake club in 2021, has been at the forefront of data and AI innovations. The year 2023 brought significant challenges and milestones, including the introduction of GenAI LLMs, completion of the Iceberg migration, initiation of the object storage journey, and a renewed focus on data privacy and security. This session delves into the strategies and lessons learned during this transformative period, with a specific focus on stabilizing platforms without compromising advancements in AI, security, and unified SQL. Overview: Challenges Faced: Connectivity issues leading to 11 days of GenAI training losses. Live production failures in interactive Darwin notebooks queries. Infrastructure development hesitations and trust issues in staging environments. Approaches and Learnings: Development of a high-throughput system for auto-building lightweight, production data lakes on Kubernetes (K8s) for every code commit and pull request (PR). Scaling flow failure insights using Prometheus, OpenTelemetry, and the Java Virtual Machine (JVM). Key Discussion Points: Recognizing symptoms indicating the need for reinvestment in foundational infrastructure. Strategies for stabilizing platforms while accommodating rapid innovation in AI, security, and unified SQL. User experience enhancements and architectural iterations implemented during the stabilization process. The journey of productionizing OpenTelemetry at LinkedIn and its impact on observability. Unanticipated challenges faced and successful resolutions encountered along the way. Results: Currently, the system supports over 20,000 ephemeral data lakes annually. Detection and resolution of 2.1K platform issues each year. By sharing LinkedIn's experiences and solutions, this session aims to provide valuable insights into managing large-scale data lakes, ensuring stability, and fostering continuous innovation. The discussion will be particularly relevant for data scientists, machine learning engineers, and infrastructure developers seeking to strike a balance between technological advancements and a robust foundation.

11:35 am - 12:05 pm Data Visualization

Practical Applications of Bayesian Statistics for Business Data Science Teams

Matt DiNauta Principal Applied Scientist at Zillow Group

Abstract Coming Soon!

11:35 am - 12:05 pm Generative AI

Generative AI Guardrails for Enterprise LLM Solutions

Preethi Raghavan Vice President, NLP and Machine learning at Fidelity

Generative LLMs have transformed consumer interactions with AI, lowering entry barriers and increasing accessibility to AI-powered solutions. However, this widespread adoption comes with potential unintended consequences, as LLM-generated content may not always be accurate or appropriate. As excitement around both vendor-based LLMs like ChatGPT, as well as open-source ones like LLaMa and similar generative AI solutions grows, organizations must remain attentive to the potential risks associated with their use. Ignoring these risks could lead to significant negative impacts on a brand or business. To address these risks and promote responsible generative AI usage in business contexts, it is crucial to implement guardrails on both the input to the LLM as well as the text generated by the LLM. In this talk, i will present both existing models in the responsible AI landscape and propose a system for content regulation.

12:10 pm - 12:20 pm Generative AI

The Evolution of Professional Assistance: From AI Assistant to AI Agents

Jin Kim Co-Founder and Business Lead at Linq

Generative Artificial Intelligence (GenAI) is set to transform knowledge work worldwide. In this talk, Jin from Linq aims to share insights from deploying GenAI productivity services globally, from the Americas to the Asia-Pacific (APAC) region, offering lessons for integrating GenAI into diverse business processes by discussing: 1. AI Assistants vs. AI Agents AI Assistants, like ChatGPT, enhance worker capabilities by performing tasks or providing information. They represent the current AI support tier. AI Agents advance this by autonomously planning and optimizing workflows. These agents proactively manage complex work processes. 2. The Impact of AI Agents on Workflow Planning AI Agents transform workflow management by not only assisting but adapting and actively participating in decision-making, serving as strategic partners. 3. Current Landscape: AI Assistants Despite their potential, AI Assistants have limitations, primarily in strategic areas. They serve as precursors to the more autonomous AI Agents and the concept of AI multi-agents, mirroring a team structure. 4. Transitioning to AI Agents Moving from Assistants to Agents is crucial, addressing the gaps in handling complex workflows and offering strategic benefits, especially in fields like consulting and legal services. 5. Data Requirements for the AI Evolution from Assistant to Agent Key data include user-customized service data, understanding and execution tasks, adapting to changes, complex decision-making, and managing interactions. 6. The Future of AI Agents: Collaborative Multi-Agent Decision-Making Envision AI Agents working as a team, leveraging varied expertise for superior problem-solving. This includes assembling the right mix of agents for a task, collaborative strategy formulation, and refining approaches based on feedback. This session aims to elucidate the path from current AI Assistants to future collaborative AI Agent ecosystems, highlighting the strategic adoption of GenAI in professional settings.

12:10 pm - 12:40 pm LLMs

Model Evaluation in LLM-enhanced Products

Sebastian Gehrmann, PhD Head of NLP, Office of the CTO at Bloomberg

Evaluation in machine learning (ML) product development is a rich topic with a long history. However, Large language models (LLMs) represent a significant deviation from the known path and introduce a lot of unknowns. Since the same LLM can be flexibly applied in a wide range of contexts both with and without additional tuning, its evaluation must reflect this increased scope. Moreover, since LLMs output natural language instead of discrete classes, we must shift our evaluation focus from classic metrics like accuracy and F1 scores to complex concepts like usefulness, attribution, factuality, and safety. Given this new paradigm, how can we build on long-standing best practices of evaluation, learn from academic research, and build solid evaluation pipelines for LLMs? Furthermore, we must consider the important role that humans play in model evaluations and determine what can be automated -- and whether it should be. In this talk, I will discuss these questions alongside common pitfalls, opportunities, and best practices related to including large language models as an additional ingredient in product development.

12:10 pm - 12:40 pm Machine Learning

Optimizing Workplace with AI and Generative Bots

Aleksandra Przegalinska Vice President at Kozminski University
Tamilla Triantoro, PhD Associate Professor, Business Analytics at Quinnipiac University

This study investigates the interplay between artificial intelligence, human skills, and task characteristics, and their impact on organizational performance. Applying the Resource-Based View and Task Technology Fit theories, we explored how generative AI designed for collaboration, as both a firm resource and a capability, can enhance task execution across different dimensions - routine/creative tasks and easy/complex tasks. We conducted an experimental study involving the development of a marketing campaign with distinct subtasks reflecting these dimensions. Our findings show that firms can gain substantial benefits from integrating AI and that AI improves task outputs in automation, support, creation, and innovation. Our study also suggests a nuanced relationship between humans and AI in creative tasks with humans outperforming AI. The study highlights the value of upskilling and reskilling in AI, and proposes a strategic blend of AI and human creativity for optimal results. These findings have implications for understanding the role of AI in organizational tasks and formulating effective strategies for AI integration in business and beyond. Our exploration includes the innovative use of GPT models as decision-support tools, integrating diverse theoretical perspectives and a clear task division between humans and AI, to enhance both the efficiency and effectiveness of AI-human interactions in various decision-making contexts.

12:20 pm - 12:50 pm MLOps

Shifting Gears to LLMOps: Understanding the Challenges in MLOps for LLMs

Noel Konagai UX Researcher at Google

With the rise of Generative AI we are increasingly confronted with a pertinent question: what about our MLOps (Machine Learning Operations) needs to change to accommodate LLMs (Large Language Models)? We argue that fundamentally the principles of MLOps are still applicable to LLMs, but the “how” of MLOps changes with LLMs. While LLMs can be used in Classical ML tasks (e.g. sentiment analysis), what complicates MLOps for LLMs is that we see a shift from model-centric thinking to an application-centric thinking. A chatbot application may not only contain the LLM itself but it might use Retrieval Augmented Generation (RAG) with a knowledge base to reduce hallucinations, use a fine-tuning process to adjust the tone of the chatbot, and use plug-ins to execute tasks on a third-party platform. Challenges in LLM evaluation ensue: while in Classical ML we had industry standard quantitative metrics such as root-mean-square error that help assess the model performance, with LLMs we enter an ambiguous space with new methods emerging to evaluate the end-user experience. All these additional components complicate running, tracking and evaluating experiments with LLMs. In this talk, we present a five step process that compares each step of MLOps (discovery, development, evaluation, deployment, and monitoring) for Classical ML with the new challenges of operationalizing LLMs for generative applications. In this talk we focus on LLMs used for generative purposes, such as chatbots. Attendees can walk away with an increased understanding of the methods and frameworks to understand their LLM productionization process, better equipped to tackle the challenges of MLOps for LLMs.

12:20 pm - 12:30 pm Responsible AI

Nurturing Responsibility in our AI Endeavors

Rishu Gandhi Senior Data Engineer in Cybersecurity at Wells Fargo

In an era where AI is becoming the norm, embracing responsibility and accountability AI practices has never been more crucial. Through real-world examples and actionable insights, we will explore what is Responsible AI and why integrating it in industry practices is not just a choice but a necessity.Together, we will navigate the ethical landscape that defines our role as data enthusiasts in the data science community. Join me on this significant discussion to build sustainable and accountable AI products.

12:30 pm - 12:40 pm Machine Learning

Spark, Dask, DuckDB, Polars: TPC-H Benchmark Results

David Chudzicki Director of Product Engineering at Coiled

Large scale dataframe computations are critical for efficient and friendly data manipulation at scale. This space has blown up recently and there are many new choices. In this talk we run major contenders (Spark, Dask, DuckDB, Polars) through the TPC-H benchmarks both locally and on the cloud at various scales ranging from 10GB to 10TB and see how they perform. This will teach us both about these specific libraries and also about how to measure and think through performance on the cloud. We'll think through topics like IO bandwidth, CPU saturation, memory constraints, as well as challenges in deployment and hardware selection. We'll bring in hardware and networking costs to get a sense for overall cost efficiency in computation. The presenters are biased towards Dask, so we'll use that project to dive a bit deeper into tuning and what's critical, but the overall results should be broadly interesting to anyone in the data infrastructure space.

12:50 pm - 1:00 pm Data Analytics

Using Data Analytics to Promoting Efficiency and Consumerism in Healthcare

Shan Xiao Founder at NeoSteed, LLC

I will draw upon over 8 years of experience as a data analyst in the healthcare industry to share insights from my industrial journey. Additionally, I'll discuss the innovative project I'm currently spearheading, which focuses on leveraging novel approaches to transform data into actionable insights and enhance patient engagement.

1:00 pm - 1:10 pm ML Safety & Security

Practical Considerations for Machine Learning in Fraud Prevention Programs

Kwan Lin Principal Data Scientist at MoonPay

In this presentation, we will walk through in general terms how machine learning has been utilized in a financial technology company in the web3 space to detect and prevent fraud. We will cover the particular considerations of fraud as it relates to building and deploying effective machine learning models. We will lightly delve into the code and tooling to illustrate from a practical standpoint how the machine learning system can be constructed. At the end of the presentation, the audience should have a conceptual understanding of how a machine learning program can be implemented to prevent fraud. The effective application of machine learning to detecting fraud does require significant nuance and measured consideration as there are unique attributes to fraud that are not present in other domains, including: the delayed availability of credible labels for fraud (sometimes fraud labels are not available for months after particular events, if ever), variations of the types of fraud that might manifest (Is it first party fraud or third party fraud? Is it truly fraud, or is it merely unfavorable customer behavior?), and considerations around timeliness to be able to prevent fraud rather than to merely react to fraud after the fact, at which point recourse options for a business might be limited. In addition to the domain concerns of how machine learning fits into a broader fraud prevention program, there are practical considerations extending from how fraud detection models are trained, to the machine learning ops considerations that are necessary to serve and maintain timely and reliable models, and through to the support mechanisms necessary to ensure that machine learning is persistently available as a business-critical function.

1:00 pm - 1:30 pm MLOps

Abstracting ARM/x86 CPUs and NVIDIA/Neuron Hardware Accelerator Allocation for Containerized ML App

Yahav Biran, PhD Principal Architect at Amazon Web Services

The shortage of hardware accelerators delays model training for customers with computationally intensive and parallel processing capabilities. Moreover, the lack of applications’ flexibility to support both general-purpose compute and high availability accelerators, makes training jobs rigid and difficult to resume after unexpected host interruptions. Also, customers cannot deploy flexible inference services that enable cost, availability, latency, and performance tradeoffs. e.g., defines compute priorities for inferences with different CPU and HW accelerator prices and locations. Until today, customers who trained models and offered model inference services had to manually configure compute infrastructure requirements that matched their application. If these resources could not be allocated, the job was delayed. Cube-scheduler allows more flexibility for machine learning jobs by automatically detecting and matching job specification to processor and hardware accelerator. Cube-scheduler seamlessly invokes ML software packages on optimal resources by abstracting the underlying runtime packages such as Linux and Python.

1:40 pm - 2:10 pm ML Safety & Security

Navigating the Landscape of Responsible AI: Principles, Practices, and Real-World Applications

Rajiv Avacharmal Corporate Vice President | New York Life Insurance

As Artificial Intelligence (AI) becomes increasingly integrated into our daily lives and business, it is imperative that we develop and deploy AI systems responsibly. The rapid advancement of AI technologies presents both immense opportunities and significant challenges, particularly in ensuring that AI systems are ethical, transparent, and accountable. This session will delve into the critical aspects of Responsible AI into the principles, practices, and real-world applications of this essential field. We will begin by exploring the fundamental principles of Responsible AI, including fairness, transparency, accountability, and privacy. These principles serve as the foundation for developing AI systems that are unbiased, explainable, and aligned with societal values. We will discuss the ethical considerations that must be taken into account throughout the AI lifecycle, from data collection and model training to deployment and monitoring. The session will then focus on the practical strategies and tools for implementing Responsible AI. We will cover techniques for mitigating bias in AI models, such as diverse and inclusive datasets, algorithmic fairness metrics, and continuous testing and monitoring. Attendees will learn about the importance of transparency and explainability in AI, and how to incorporate these principles into the design and development of AI systems. We will also address the critical role of governance and regulation in ensuring Responsible AI. This includes discussing the current landscape of AI regulations and guidelines, such as the EU Ethics Guidelines for Trustworthy AI and the IEEE Ethically Aligned Design framework. We will explore how organizations can establish robust governance frameworks that ensure AI systems meet ethical standards and comply with legal requirements.

2:00 pm - 2:30 pm LLMs

TorchTune: Making LLM Finetuning Accessible for Developers

Suraj Subramanian Machine Learning Advocate at Facebook

Finetuning pre-trained language models (LLMs) has become a crucial step in many natural language processing (NLP) tasks. However, the process of finetuning can be daunting, especially for developers who are new to the field. TorchTune is a tool that aims to make finetuning LLMs more accessible and straightforward. In this talk, we will explore how TorchTune can help developers demystify the process of finetuning language models. We will discuss the tool's features, its ease of use, and how it can be integrated with various libraries and platforms. Join us to learn how TorchTune can help you streamline your LLM adaptation workflows and take your models to the next level.

2:00 pm - 2:30 pm Big Data Analytics

Conversational Data Intelligence: Transforming Data Interaction and Analysis

Kevin Rohling Head of AI Engineering at Presence Product Group

In our data-rich world, the capacity for efficient and intuitive interaction with massive data sets is more crucial than ever. Kevin Rohling, Head of AI at Presence Product Group, presents an exploration into Conversational Data Intelligence (CDI), a fusion of advanced AI technologies and data analytics that is redefining our engagement with data. CDI emerges as a pivotal innovation, leveraging Large Language Models and Semantic Search to facilitate natural, conversational interactions with complex data. This novel approach simplifies data navigation, breaking down barriers to data literacy and enabling professionals across various disciplines to access and interpret data without specialized data science expertise. The presentation will venture into the practical applications of CDI, showcasing its transformative impact across multiple sectors, including healthcare, legal, finance, and government. By illustrating real-world scenarios, the talk will demonstrate how CDI empowers professionals to make data-driven decisions more efficiently and accurately. Key to this discussion are the core technologies that fuel CDI. We will delve into the integration of Large Language Models, Semantic Search, and Natural Language to Query mechanisms, offering insights into their functionality and role in enhancing data interaction. However, the journey with CDI is not without its challenges. The talk will also address critical concerns such as data privacy, security, and the potential for bias. These aspects are integral to the responsible adoption and evolution of CDI technologies. Attendees will leave with a comprehensive understanding of CDI's capabilities and applications, equipped with insights into how CDI can be effectively integrated into their own industries. This talk is more than a presentation; it's an invitation to envision a future where data interaction is more accessible, insightful, and influential in driving innovation and efficiency across various professional landscapes.

2:00 pm - 2:30 pm LLMs

RAG, the bad parts (and the good!): building a deeper understanding of this hot LLM paradigm’s weaknesses, strengths, and limitations

Sara Zanzottera AI Engineer at Kwal

Off-the-shelf Large Language Models (LLMs) such as GPT-4 have already proven their versatility in numerous tasks and are revolutionizing entire industries. However, achieving exceptional performance in highly specific domains can be challenging, and traditional fine-tuning is often not accessible, due to its extensive demands in terms of data, finances, and expertise, exceeding the means of most organizations. Retrieval-Augmented Generation (RAG) is a widely adopted technique to augment the knowledge of LLMs within very specific domains while mitigating hallucinations. RAG achieves this by shifting the burden of information retrieval from the LLM's internal knowledge to external retrieval systems, often more specialized in this task due to their focused scope. However, RAG is not a silver bullet. Getting it to perform effectively can be far from trivial, and for some use cases it’s not applicable entirely. In this talk we will first understand what RAG is, where it shines and why it works so well in these applications. Then we are going to see the most common failure modes and walk through a few of them to evaluate whether RAG is a suitable solution at all, how to fix it or alternatively what approaches could be a better fit for the specific use case.

2:00 pm - 2:30 pm ML for Biotech and Pharma

Bringing Precision Medicine to the Field of Mental Healthcare through Large Language Models, AI, and Psychedelics

Gregory Ryslik, PhD Chief Technology Officer at Compass Pathways

In the United States, there are approximately 132 suicides per day in the United States (American Foundation for Suicide Prevention, 2023). In fact, suicide numbers have been rising consistently for the past two decades and 2022 recorded the highest number ever in the US with data suggesting that suicide is more common now than any time since the start of WWII (Centers for Disease Control and Prevention, 2023). This panel will focus on the relative lack of progress in the field of mental health and its root causes ranging from a lack of new and novel drugs to an insufficiency of precision diagnostic tools when compared to the rest of the medical field. It will also focus on a variety of solutions that are currently being developed to help reduce this deficit – from new psychedelic-based treatments to novel precision tools based on wearables, vision, and voice measurement. As an example, we’ll dive into how Generative AI and Large Language Models can be used to get measurements of how treatment is progressing from the provider’s and patient’s viewpoints to get to novel object features that could be used to potentially help predict outcomes or improve care.

2:00 pm - 2:30 pm LLMs

Copilot: Generative AI and Large Language Models.

Arushi Jain Applied Scientist (AI & ML) at Microsoft Corporation

Copilot or AI agents are getting useful daya by day by enhancing our productivity and creativity in mundane tasks. I will start the talk with Evolution of Large Language Models (LLMs) covering different techniques and historical benchmarks in the NLP field. Next, I will dive deeper into the concept of Copilot, how it's useful to us and different applications for Copilot today. The core of any AI agent system is Language understanding so will explain what Language Understanding is through an example and how it works in M365 Copilot today. Lastly, I will cover some strategies for using LLMs for Language Understanding like Dynamic Prompting and Fine Tuning which helps in making a robust domain tagging mechanism to improve Search quality in Copilots/AI agents along with some limitations of Fine Tuning LLMs.

2:20 pm - 2:50 pm MLOps

Cost Containment A Critical Piece of your Data Team's ROI

Lindsay Murphy Head of Data at Secoda

Data teams spend a lot of time measuring and optimizing the effectiveness of other teams. Unfortunately, we're not so great at doing this for ourselves. In this talk, we will dive into a big blind spot that a lot of data teams operate with–not knowing how much they are costing their business (now and in the future). Given how easy it is to rack up expensive bills in pay-as-you-go tools across data stacks, this can become a big problem for data teams, very fast. We'll discuss tactical methods for building cost monitoring metrics and reporting (and why you should make this a priority), some of the challenges you will face along the way, and suggest ways to implement cost containment best practices into your workflows to drive cost accountability across your data team and company.

2:20 pm - 2:50 pm Machine Learning

The Promise of Edge ML: Bringing Your Model to Your Data

David Aronchick CEO at Expanso

In the intersection of machine learning (ML) and edge computing, this talk will explore the new opportunity in processing data with ML where it's generated. We'll discuss the advantages of edge ML, including immediate insights, privacy preservation, and reduced network demands. Challenges like resource constraints and the need for efficient model management will be addressed, emphasizing solutions such as lightweight architectures and robust MLOps practices. The session will briefly highlight the impact on industries like autonomous vehicles and smart manufacturing, and the environmental benefits of localized data processing. Attendees will understand how edge ML is a strategic necessity for harnessing data's full potential, ensuring privacy, and enhancing operational efficiency. Join us to discover how ML at the edge is driving the next wave of digital innovation.

2:35 pm - 3:05 pm LLMs

LangChain on Kubernetes: Cloud-Native LLM Deployment Made Easy & Efficient

Ezequiel Lanza AI Open source Evangelist at Intel

Deploying large language model (LLM) architectures with billions of parameters can pose significant challenges. Creating generative AI interfaces is difficult enough on its own but add to that the complexity of managing a complex architecture while juggling computational requirements and ensuring efficient resource utilization, and you’ve got a potential recipe for disaster when transitioning your training models to a real-world scenario. LangChain, an open source framework for developing applications powered by LLMs, aims to simplify creating these interfaces by streamlining the use of several neuro-linguistic programming (NLP) components into easily deployable chains. At the same time, Kubernetes can help manage the underlying infrastructure. This talk walks you through how to smoothly and efficiently transition your trained models to working applications by deploying an end-to-end LLM containerized application built with LangChain in a cloud-native environment using open-source tools like Kubernetes, LangServe, and FastAPI. You'll learn how to deploy a trained model quickly and easily that's designed for scalability, flexibility, and seamless orchestration.

2:35 pm - 3:05 pm Machine Learning

Harmony in Complexity: Unveiling Mathematical Unity Across Logistic Regression, Artificial Neural Networks, and Computer Vision

Dr. Liliang Chen Financial Analytics Manager at Freddie Mac

This presentation embarks on an exploration of the intricate interconnections that bind logistic regression, neural networks, and computer vision, unveiling their shared foundational principles through the lens of linear algebra. The main focus of this exploration is to highlight how abstract mathematical concepts play a crucial role in shaping and bringing together these different methodologies. By drawing meaningful parallels between the construction of logistic regression functions and their mathematical representations, we create a path to understanding the intrinsic relationship between these two entities. In logistic regression, the linear function, dynamically molded by a combination of various features, emerges as a visual metaphor—a plane in the mathematical fabric. This illustration sets the stage for the intricate processes happening in neural networks. In the realm of neural networks, the combination of weights and nodes takes center stage as space surrounded by multi-dimensional planes. The alignment of these planes with linear algebra principles becomes apparent, highlighting the basic math that shapes how neural networks work. Despite their outward dissimilarity, an underlying mathematical structure binds these models together, with the singular differentiator residing in the activation function. Logistic regression leans on the sigmoid function, while neural networks embrace the ReLU function, showcasing the versatile adaptability of these mathematical tools. The widespread use of the ReLU activation function in neural networks and convolutional neural networks (CNNs) reveals a shared common mathematical foundation. CNNs are widely employed in computer vision algorithms. This consistency across architectures underscores the universality of the principles derived from linear algebra. Transitioning into the realm of computer vision, we explore the application of filters as weighted combinations of pixel features. This extends the linear algebraic concept to image processing, demonstrating the versatility and applicability of these mathematical principles across diverse domains. In essence, this presentation seeks to illuminate the profound harmony and shared essence of mathematical principles that transcend traditional disciplinary boundaries. It underscores the unifying influence of linear algebra in unraveling the core relationships defining the evolution of machine learning and computer vision paradigms, providing a holistic perspective for researchers.

2:35 pm - 3:05 pm Responsibe AI

Making Open Source AI Safe

Richard Mallah Principal AI Safety Strategist at Future of Life Institute

As practical AI systems become more capable and more general-purpose, aligning them with the intent of developers, users, and other stakeholders becomes increasingly important. The attempt to align a system is the attempt to prevent or mitigate potential harms the system might cause or assist, and insufficient alignment can lead to negative consequences of various sizes that depend on the capabilities of the system. Properly aligned systems prevent both system-initiated unintended behaviors and system abuse by users. There are many different aspects to be aligned, and many different pitfalls to address, from disinformation to active manipulation to self-proliferation, and many more. In this talk, we address risk management methods for mapping and prioritizing harms to mitigate from the system or model, and ways to start considering how to perform those harm mitigations, framed from a context of implementing the NIST AI Risk Management Framework. We will review taxonomies of risks from sources including DeepMind and OpenAI, considerations for detecting and measuring risks, methods to prioritize risk mitigation efforts, and approaches to properly aligning a system so as to significantly mitigate those risks. We will particularly drill down on a prioritization analysis looking at the offensive capability of a system within a given domain versus its defensive capability in that domain.

3:00 pm - 3:30 pm Machine Learning

Leveraging Predictive Models and Data Science to Optimize Information Retrieval Systems

Vidhya Suresh Senior Software Engineer at Atlassian
Hareen Venigalla Applied Science Manager at Uber Inc

This presentation explores how data science and predictive modeling optimize the performance and scalability of information retrieval (IR) systems. We'll examine the impact of query analysis, document ranking, and result aggregation on user satisfaction. Our research demonstrates that techniques like keyword extraction, intent analysis, and custom deep ranking models can reduce irrelevant results by up to 26% while decreasing computing costs by more than 39%. We'll address the challenges of scaling IR systems to handle massive datasets and high query volumes, highlighting how predictive models streamline resource-intensive processes. Finally, we'll present optimization strategies leveraging distributed computing, multi-stage caching, and predictive ranking models to enhance throughput, reduce latency, and minimize computational overhead. This presentation offers valuable insights for those interested in the intersection of data science and information retrieval.

3:10 pm - 3:40 pm ML for Biotech and Pharma

Machine Learning in Drug Discovery: How Not to Lie with Computational Models?

Srijit Seal Researcher at Broad Institute of MIT and Harvard

In recent years, predictive toxicity models in drug discovery have seen remarkable progress, driven by the availability of extensive molecular data and the rapid evolution of machine learning (ML) techniques. However, the lack of a comprehensive benchmark tailored to the unique complexities of conditional parameters in toxicity, such as concentration and pharmacokinetics, has hindered the advancement and effective comparison of novel ML algorithms. In response to this challenge, we present a versatile and machine learning-ready benchmark dataset curated from diverse sources, including ChEMBL, PubChem, FDA datasets, and other scientific publications. The challenges we present are designed for machine learning researchers aiming to make impactful contributions to real-world drug discovery. We present a diverse array of predictive tasks relevant to real-world drug discovery, along with parameters such as human pharmacokinetics, dose, concentration, and cell line data. It integrates in vitro data essential for toxicity prediction, such as hERG inhibition and microsomal stability, and delves deeper into in vivo outcomes, such as cardiotoxicity labels encompassing both arrhythmia and structural heart damage. It offers curated datasets for protein target prediction, emphasizing diverse protein functions beyond just inhibition, and covers pharmacokinetics data such as plasma concentration. It also incorporates environmental toxicity data, covering the ecological footprint of drug compounds, and a dataset on natural compounds’ protein binding. We set out specific challenges for classification and regression tasks, as well as multitask and transfer learning models, along with recommended dataset splits for validation that cover various random splits as well as out-of-distribution splits. Each task is tailored to mirror real-world drug discovery challenges and aims to bridge the gap between machine learning predictions and practical drug development outcomes. We provide preprocessed molecular features from a wide range of modalities, such as structural features, cell imaging, and gene expression, which can be used as input features for models. This presentation is a collaborative endeavor, pooling insights from both industry and academia, designed to offer ML researchers a benchmark dataset that can be used to make meaningful contributions to real-world drug discovery.

3:10 pm - 3:40 pm MLOPs

Beyond Simple A/B Testing: Advanced Experimentation Tactics

Liz Obermaier Data Scientist at Statsig

A/B testing is rapidly establishing itself as a core tool in product development. In this talk, we will start with a recap of standard A/B testing, including best practices. We'll also explore cutting-edge, less-familiar but powerful methodologies which address well-known limitations of standard A/B Testing. These include Sequential Testing, Multi-Armed Bandits, Switchback Experiments, Stratified Sampling, Heterogeneous Effects Detection and Experimental meta-analysis. Designed for data professionals and product builders, this presentation aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation.Beyond Simple A/B Testing: Advanced Experimentation Tactics.

9:30 am - 10:30 am Generative AI

Multimodal Retrieval Augmented Generation

Valentina Alto Azure Specialist - Data and Artificial Intelligence at Microsoft

Retrieval augmented generation (RAG) soon became established as the reference architecture whenever we want to inject custom knowledge into our LLM-powered applications. Insofar, RAG has been applied to text data. Nevertheless, with the launch of GPT-4-turbo vision, we can extend the same concept also data different from texts, such as images. In this workshop, we are going to cover the architecture behind a typical RAG application and how to incorporate images within this architecture, leveraging GPT-4-turbo with vision. To do so, we will see a practical implementation with Python and LangChain, consuming the model API from Azure OpenAI service.

9:30 am - 10:30 am LLMs

Data Synthesis, Augmentation, and NLP Insights with LLMs

Tamilla Triantoro, PhD Associate Professor, Business Analytics at Quinnipiac University

Data synthesis, augmentation, and NLP insights with LLMs offer a foundational approach to understanding and utilizing artificial intelligence in data science. This workshop is designed to guide participants through the process of creating synthetic data, enhancing datasets through augmentation, and applying NLP techniques to extract valuable insights. These skills are essential in various fields such as social media analysis, customer behavior studies, content generation, and more. By participating in this workshop, you will learn how to generate realistic and functional synthetic data using LLMs. You will also explore methods to enrich this data and make it more applicable for real-world scenarios. Additionally, you will apply NLP techniques to synthesized and augmented data to uncover patterns, sentiments, and trends.

9:30 am - 10:30 am Machine Learning

Introduction to Apache Arrow and Apache Parquet, using Python and Pyarrow

Andrew Lamb Chair of the Apache Arrow Program Management Committee | Staff Software Engineer at InfluxData

DE Tutorial: This workshop will cover the basics of Apache Arrow and Apache Parquet, how to load data to/from pyarrow arrays, csv and parquet files, and how to use pyarrow to quickly perform analytic operations such as filtering, aggregation, joining and sorting. In addition, you will also experience the benefits of the open Arrow ecosystem and see how Arrow allows fast and efficient interoperability with pandas, pol.rs, DataFusion, DuckDB and other technologies that support the Arrow memory format.

9:30 am - 10:30 am Generative AI

Open-source AI with Hugging Face

Julien Simon Chief Evangelist at Hugging Face

In this session, you’ll learn how open-source models can help you build high-quality AI applications, generative or not, while giving you more flexibility, control, and ROI than closed-model APIs. We’ll highlight the latest and greatest models, and show you how to get started with them in minutes. Along the way, you’ll also learn about the technical ecosystem that Hugging Face is fostering, from models and datasets, to cloud integrations and hardware acceleration.

9:30 am - 10:30 am

How to Build an Interactive Front End for Your Python Data Science Models

Mingo Sanchez Senior Sales Engineer at Plotly

From natural language processing to spatial analysis, data science models are becoming more advanced every day. For those models to translate into actionable insights, the data needs to be visualized in an interactive, accessible fashion. That’s where front-end analytics tools like Plotly and Dash come in! Plotly is a globally known data visualization library used by millions of people worldwide. Built on top of Plotly is Dash, an open-source framework that translates individual graphs or charts into interactive Python data apps. These web-based data apps expand far beyond what is possible with a traditional “dashboard”: they are interactive, customizable, and can incorporate all types of AI/ML libraries like ChatGPT, LangChain, and TensorFlow. Additionally, data scientists can be autonomous and own the development to deployment cycle of these data apps without the need for IT or full-stack knowledge. In this workshop, Plotly Solutions Engineer Mingo Sanchez will provide a step-by-step tutorial on how to build an interactive front end for your Python data science models. He will cover the following: ● Introduction to the fundamentals of Plotly graphing libraries ● Overview of Dash basic architecture like layouts and callback functions ● Walkthrough of an interactive AI/ML data app ● Examples of tools to help geospatial data use cases ● Showcase other real-world data applications

9:30 am - 11:30 am LLMs

LLM Best Practises: Training, Fine-Tuning and Cutting Edge Tricks from Research

Sanyam Bhutani Sr. Data Scientist and Kaggle Grandmaster

Large Language Models (LLMs) are still relatively new compared to ""Traditional ML"" techniques and have many new ideas as best practises that differ from training ML models.Fine-Tuning models can be really powerful to unlock use-cases based on your domain and AI Agents can be really powerful to unlock previously impossible ideas. In this workshop, you will learn the tips and tricks of creating and fine-tuning LLMs along with implementing cutting edge ideas of building these systems from the best research papers. We will start by learning the foundations behind what makes a LLM, quickly moving into fine-tuning our own GPT and finally implementing some of the cutting edge tricks of building these models. There is a lot of noise and signal in this domain right now, we will focus on understanding the ideas that have been tried and tested. The workshop will also cover case studies spanning ideas that have worked in practise we will dive deep into the art and science of working with LLMs.

10:05 am - 11:05 am Generative AI

Mastering PrivateGPT: Tailoring GenAI for your unique applications

Dr. Daniel Gallego Vico Co-Founder at PrivateGPT | Zylon
Iván Martínez Toro Creator and Main Contributor | Co-founder at PrivateGPT

Tutorial: PrivateGPT, a well-recognized open-source project with 48K Github stars and a Discord community composed by more than 3K supporters, offers a robust framework for developing Private Context-aware GenAI applications. Tailored to support real-world production scenarios, it provides a default set of functions that efficiently handle common tasks like ingestion of documents, contextual chat and completions, as well as embeddings generation. However, its true strength lies in its adaptability, enabling customization for specific applications. This tutorial guides you through various configuration options and extensions of PrivateGPT. You'll begin by gaining hands-on experience with its default API and functionalities using its Python SDK. Subsequently, you'll explore tweaking settings to adapt it to different setups, ranging from fully local where everything runs in your computer to multi-service where the LLM, embedding model or vector database can be served by different services. The final segment of the tutorial will lead you through PrivateGPT's internal AI logic and architecture to learn how to extend its basic RAG functionalities. Upon completing this tutorial, you'll acquire the skills to customize PrivateGPT for any scenario, whether it be for personal use, intra-company initiatives, or as part of innovative commercial production setups.

11:00 am - 12:00 pm Generative AI

Intro to the ChatGPT API

Andras Zsom, PhD Assistant Professor of the Practice, Director of Graduate Studies at Data Science Institute, Brown University

Tutorial: Conversational AI, especially ChatGPT, has become extremely popular over the past year. By January 2023, ChatGPT was the fastest-growing consumer software application in history. While many are familiar with and frequently use its web interface, we will explore its API (Application Programming Interface). The API access allows you to interact with ChatGPT in a Jupyter Notebook or any other coding environment and use it as a developer tool. It radically speeds up the development and deployment of many natural language processing tasks such as text summarization, sentiment analysis, topic modeling, text transformations (such as translation, grammar correction, and style adjustments), and chatbot development. I will show how to perform these tasks during the tutorial. I hope that by the end, you will be well-equipped to start innovating with ChatGPT and develop your own applications. I assume attendees have standard Python knowledge and know how to work with container types (such as lists and dictionaries), control flow (like for loops and if statements), and functions.

11:00 am - 12:00 pm Generative AI

GenAI Assisted Feature Engineering

Sergey Yurgenson Head of Semantic Data Science at Featurebyte

Session Abstract Coming Soon!

12:00 pm - 1:00 pm Deep Learning

How to Practice Data-Centric AI and Have AI Improve its Own Dataset

Jonas Mueller Chief Scientist and Co-Founder at Cleanlab

DE Tutorial: In Machine Learning projects, one starts by exploring the data and training an initial baseline model. While it’s tempting to experiment with different modeling techniques right after that, an emerging science of data-centric AI introduces systematic techniques to utilize the baseline model to find and fix dataset issues. Improving the dataset in this manner, one can drastically improve the initial model’s performance without any change to the modeling code at all! These techniques work with any ML model and the improved dataset can be used to train any type of model (allowing modeling improvements to be stacked on top of dataset improvements). Such automated data curation has been instrumental to the success of AI organizations like OpenAI and Tesla. While data scientists have long been improving data through manual labor, data-centric AI studies algorithms to do this automatically. This tutorial will teach you how to operationalize fundamental ideas from data-centric AI across a wide variety of datasets (image, text, tabular, etc). We will cover recent algorithms to automatically identify common issues in real-world data (label errors, bad data annotators, outliers, low-quality examples, and other dataset problems that once identified can be easily addressed to significantly improve trained models). Open-source code to easily run these algorithms within end-to-end Data Science projects will also be demonstrated. After this tutorial, you will know how to use models to improve your data, in order to immediately retrain better models (and iterate this data/model improvement in a virtuous cycle).

12:00 pm - 1:00 pm NLP

Build AI Assistants with Large Language Models

Rafael Vasquez Software Developer at IBM
James Busche Senior Software Developer at IBM

Over the past year, there has been a surge in the popularity of Large Language Models (LLMs). However, how can we effectively leverage LLMs to augment our businesses? One example would be the integration of LLMs into existing business frameworks through the deployment of AI Assistants. These assistants serve as invaluable tools in addressing customer inquiries and minimizing the demand for technical support within organizations. In this session, we will dive into the practicalities of utilizing LLM-powered AI Assistants and seamlessly integrating them into established systems. This workshop provides an easy-to-follow guide on how to use LLMs, configure the settings for your first AI Assistant with LLMs, and seamlessly integrate AI Assistant into an established system. Session Outline: 1. Learn about LLM basic We will be using LLMs hosting on IBM Digital Self-Serve Co-Create Experience (DSCE), but you can also use models that are hosted on other platforms such as Huggingface. 2. Configure the settings for your first AI Assistant with LLMs Learn the basics of watsonx Assistant and create the first AI conversation with LLMs. Then apply this chatbot to an established system. Background Knowledge: The attendees will learn about the concept of building a chatbot, create AI conversation, and integrate it into production.

12:05 pm - 1:05 pm ML for Biotech and Pharma

Introduction to Protein Language Models for Synthetic Biology

Etienne Goffinet, PhD Senior Researcher at Technology Innovation Institute

Protein Language Model are Transformer-like models that are trained on massive sets of protein sequences (represented as text) in an attempt to learn the biological 'grammar' of proteins.These models have a broad range of application, thanks to their generative and embedding abilities. In this workshop, we will get more familiar with this type of model, how they differ from their NLP counterparts and the tasks they can address. we will also get a short overview of the existing open-source models and datasets. During the hands-on session, we will start from a pre-trained language model and develop a basic example of protein function multi label classifier. We will then develop compare and benchmark different classification approaches, including a simple retrieval-augmented enhancement, and fine tuning.

12:05 pm - 1:05 pm ML for Biotech and Pharma

Data Science in the Biotech/Pharma Research Organization

Eric Ma, PhD Author of nxviz Package | Principal Data Scientist at Moderna

Tutorial: In this hands-off tutorial, I will provide a framework for thinking about, and hence organizing, data science in biotech and pharmaceutical research organizations. Together, we will cover: (1) what the core mission of a data science team should be, (2) the ways a data science team can deliver value to the research organization, (3) major classes of problems and methods, and (4) challenges that are unique to a data science organization in the _research_ space, as contrasted to clinical development, manufacturing, and commercial organizations. By the end of this session, data science leaders at biotech and pharma companies who attend this session will be equipped with frameworks for for thinking about data science problems in the biotech and pharma research space. Executives who are unfamiliar with the research space of data science problems will walk away with a broad, high-level overview of data science problems in research and how to frame and understand their value.

1:10 pm - 2:10 pm Generative AI

Tutorial: Deploying Trustworthy Generative AI

Krishnaram Kenthapadi Chief AI Officer & Chief Scientist at Fiddler AI

Tutorial: Generative AI models and applications are being rapidly deployed across several industries, but there are several ethical and social considerations that need to be addressed. These concerns include lack of interpretability, bias and discrimination, privacy, lack of model robustness, fake and misleading content, copyright implications, plagiarism, and environmental impact associated with training and inference of generative AI models. In this talk, we first motivate the need for adopting responsible AI principles when developing and deploying large language models (LLMs) and other generative AI models, and provide a roadmap for thinking about responsible AI for generative AI in practice. Focusing on real-world LLM use cases (e.g. evaluating LLMs for robustness, security, etc. using https://github.com/fiddler-labs/fiddler-auditor), we present practical solution approaches / guidelines for applying responsible AI techniques effectively and discuss lessons learned from deploying responsible AI approaches for generative AI applications in practice. By providing real-world generative AI use cases, lessons learned, and best practices, this talk will enable researchers & practitioners to build more reliable and trustworthy generative AI applications. Please take a look at our recent ICML/KDD/FAccT tutorial (https://sites.google.com/view/responsible-gen-ai-tutorial) for an expanded version of this talk.

2:00 pm - 3:00 pm Machine Learning

Workflow-based GeoAI Analysis with No/Low-Code Visual Programming

Lingbo Liu, PhD Postdoctoral Research Fellow at Harvard University

In this training session, we will explore the utilization of low-code/no-code visual programming platforms to effectively integrate geospatial analysis with a variety of AI algorithms, including machine learning, deep learning, and Explainable AI. Designed primarily for data science novices, this training enables participants to easily embark on their journey without needing extensive programming expertise. They will learn to harness the platform for advanced spatial analysis and the development of sophisticated AI models. The training is structured into four comprehensive sections: Introduction to the Visual Programming Platform: We will begin by introducing the open-source KNIME Analytics Platform (AP), detailing its basic features and user interface. Participants will become familiar with its intuitive visual programming environment. AI Functions in KNIME AP: This segment will cover the platform's advanced AI functionalities, providing insights into the range and capabilities of its AI tools. Extension on Geospatial Analysis for KNIME AP: Participants will delve into specific geospatial analysis applications, learning how to manage spatial data and execute spatial analyses within KNIME. Case Demonstration: The final part will focus on constructing AI models using the KNIME platform, with a special emphasis on deep learning and explainable AI models. A practical case study will be presented to demonstrate these models' application in geospatial analysis. Through this training, participants, irrespective of their data science background, will gain essential skills to employ the KNIME platform for both geospatial analysis and AI model applications. This will lay a solid foundation for their continued learning and practice in this evolving field.

2:00 pm - 3:00 pm Machine Learning

Idiomatic Pandas

Matt Harrison Python & Data Science Corporate Trainer | Consultant at MetaSnake

Pandas can be tricky, and there is a lot of bad advice floating around. This tutorial will cut through some of the biggest issues I've seen with Pandas code after working with the library for a while and writing three books on it. We will discuss: * Proper types * Chaining * Aggregation * Debugging

2:20 pm - 3:20 pm Generative AI

Stable Diffusion: Advancing the Text-to-Image Paradigm

Sandeep Singh Head of Applied AI/Computer Vision at Beans.ai

This session will introduce attendees to Stable Diffusion, a new text-to-image generation model that is more stable and efficient than previous models. Stable Diffusion is able to generate high-quality images from text descriptions, and it is well-suited for a variety of applications, such as creative content generation, product design, and marketing. Learning Outcomes: By the end of this session, attendees will be able to: - Understand the basics of Stable Diffusion and how it works. - Know whole landscape of tools and libraries for Stable Diffusion domain. - Generate images from text descriptions using Stable Diffusion. - Apply Stable Diffusion to their own projects and workflows. - Understand the process of fine-tuing open source models to achieve tasks at hand. This session is relevant to practitioners in a variety of industries, including: Creative industries: Stable Diffusion can be used to generate images for marketing materials, product designs, and other creative projects. Technology industries: Stable Diffusion can be used to develop new applications for text-to-image generation, such as chatbots and virtual assistants. Research industries: Stable Diffusion can be used to conduct research on text-to-image generation and its applications.

2:20 pm - 3:20 pm Machine Learning

No-Code and Low-Code AI: A Practical Project Driven Approach to ML

Gwendolyn D. Stripling, PhD Lead AI & ML Content Developer at Google Cloud

Tutorial: No-code machine learning (ML) is a way to build and deploy ML models without having to write any code. Low-code ML is a way to build and deploy ML models with minimal coding. Both methods can be valuable for businesses and individuals who do not have the skills or resources to develop ML models themselves. By completing this workshop, you will develop an understanding of no-code and low-code frameworks, how they are used in the ML workflow, how they can be used for data ingestion and analysis, and for building, training, and deploying ML models. You will become familiar with Google’s Vertex AI for both no-code and low-code ML model training, and Google’s Colab, a free Jupyter Notebook service for running Python and the Keras Sequential API, a simple and easy-to-use API that is well-suited for beginners. You will also become familiar with how to assess when to use low-code, no-code, and custom ML training frameworks. The primary audience for this workshop are aspiring citizen data scientists, business analysts, data analysts, students, and data scientists who seek to learn how to very quickly experiment, build, train, and deploy ML models.

3:30 pm - 4:30 pm LLMs

Practical Challenges in LLM Evaluation

Hailey Schoelkopf Research Scientist at EleutherAI

Tutorial: Evaluation is critical in both the development and successful downstream deployment of LLMs. It can let you compare models for your use case, to assess whether model quality is improving during development or performing favorably to other models in comparison. However, many common pitfalls can plague evaluation, causing results to be subtly incorrect, unfair across models, or otherwise misleading or inaccurate. As a developer and researcher studying LLMs who maintains an open source tool for orchestrating and standardizing LLM evaluation best practices, I'll be discussing many of the common pitfalls and challenges to be aware of when evaluating models, ranging from inconsistencies in prompting, tokenization failures, overfitting to benchmarks, data contamination, poor benchmark quality and more. We'll also go over the basic building blocks of evaluation, ranging from free-form text generation, to perplexity, to loglikelihood-based multiple choice evaluation, as well as other evaluation methods including model-based evaluation (LLM-as-a-judge) or preference-based evaluations, adversarial testing, and multimodal and agent or environment-based evals. You should leave this talk thoroughly aware of just how many things in evaluation can go wrong, and how to set yourself up for success in avoiding these common mistakes when evaluating your LLMs during development or for production use cases.

3:30 pm - 4:30 pm Data Engineering

Engineering Knowledge Graph Data for a Semantic Recommendation AI System

Ethan Hamilton Data Engineer at Enterprise Knowledge

Semantic recommendation systems are a type of AI system that can help surface content in vast repositories by representing the data as a knowledge graph and implementing graph traversal algorithms that return relevant content to end users. These systems can be very useful for clients across industries, and plenty of fun for the data engineers on-board, requiring skills such as auto-tagging, ETL pipeline construction and orchestration, and graph algorithm design and implementation. Learn how to design such a system in this in-depth tutorial.

4:35 pm - 5:35 pm Machine Learning

Using Graphs for Large Feature Engineering Pipelines

Wes Madrigal CEO / Co-Founder at Kurve, Inc.

Graph data structures provide a versatile and extensible data structure to represent arbitrary data. Data entities and their associated relations fit nicely into graph data structures. We will discuss GraphReduce, an abstraction layer for computing features over large graphs of data entities. This talk will outline the complexity of feature engineering from raw entity-level data, the reduction in complexity that comes with composable compute graphs, and an example of the working solution. We will also discuss a case study of the impact on a logistics & supply chain machine learning problem. If you work on large scale MLOps projects, this talk may be of interest.

4:35 pm - 5:35 pm MLOPs

How to Migrate from Batch to Realtime Streaming Data

Christina Lin Developer Advocate at Redpanda

In this hand-on workshop I’ll step you through how to addressing the migration from traditional batch processes to real-time data pipelines. Participants will gain insights into the differences between batch and real-time paradigms, learn practical steps and receive hands-on guidance for shifting batch to stream. * Replace batch data ingestion pipelines with CDC * Build a streaming ETL pipeline to replace Apache Spark * Building a stateless pipeline with WASM engine. * Building a stateful pipeline with Apache Flink

4:35 pm - 5:35 pm

Going From Unstructured Data to Vector Similarity Search

Steve Pousty Founder at Tech Raven Consulting

One of the key concepts used in AI modeling is the storage and queqy of vectors. This workshop will start with 2 examples of unstructured data, images and journal abstracts. Participants will then work this data all the way through to a usable data store with an application on top it. We will cover things such as transformers, embeddings, HuggingFace, choosing a vectorization model, and effective query composition. The python code needed to do this is quite easy to understand. If you want to work on your own laptop, there will be prerequisites published before hand. If not we will use a cloud hosted environment for you to do all your work. After this workshop you should be able to understand some of the concepts being used in these new AI architectures and have a better grasp on the work involved with building “AI” applications.

4:35 pm - 5:35 pm LLMs

Pink Elephants and Direct Principle Feedback

Louis Castricato Research Scientist at EleutherAI

Tutorial: This tutorial presents Direct Principle Feedback (DPF), a novel approach for fine-tuning language models (LLMs) to dynamically obey new behavioral constraints at inference time. DPF addresses the Pink Elephant Problem, enabling models to avoid discussing specified unwanted topics (""""Pink Elephants"""") while focusing on desired ones (""""Grey Elephants""""). By applying DPF with high-quality synthetic data, we teach models to effectively navigate complex content guidelines across multiple contexts, offering a significant advancement over traditional reinforcement learning methods for LLM control. Targeting professionals in fields requiring dynamic content control, such as edtech and social media, this session elucidates the process of generating synthetic preference data, the mechanics of DPF, and its application for enhancing LLM controllability. Participants will acquire the expertise to deploy LLMs capable of adapting to specific content guidelines, ensuring relevance and compliance in diverse deployment scenarios. Through this tutorial, attendees will gain insights into leveraging DPF for addressing not only the Pink Elephant Problem but also broader challenges in LLM behavior control, marking a step forward in the development of adaptable, context-aware AI systems.

10:00 am - 11:00 am Machine Learning

Causal AI: from Data to Action

Dr. Andre Franca CTO at connectedFlow

In this talk, we will explore and demystify th world of Causal AI for data science practitioners, with a focus on understand cause-and-effect relationships within data to drive optimal decisions. In this talk, we will focus on: * from shapley to DAGs: the dangers of using post-hoc explainability methods as tools for decision making, and how tranditional ML isn't suited in situations where want to perform interventions on the system. * discovering causality: how do we figure out what is causal and what isn't, with a brief introduction to methods of structure learning and causal discovery * optimal decision making: by understanding causality, we now can accurately estimate the impact we can make on our system - how to use this knowledge to derive the best possible actions to make? This talk is aimed at both data scientists and industry practitioners who have a working knowledge of traditional statistics and basic ML. This talk will also be practical: we will provide you with guidance to immediately start implementing some of these concepts in your daily work.

10:00 am - 11:00 am LLMs

Operationalizing Local LLMs Responsibly for MLOps

Noah Gift Pioneering MLOps Leader & Author, Veteran Startup CTO, Duke Data Science & AI EIR

I. Introduction to LLMs (5 mins) Defining foundation of large language models Use cases like search, content generation, programming II. Architecting High-Performance LLM Pipelines (15 mins) Storing training data efficiently at scale Leveraging specialized hardware accelerators Optimizing hyperparameters for cost/accuracy Serving inferences with low latency III. Monitoring and Maintaining LLMs (10 mins) Tracking model accuracy and performance Retraining triggers to stay performant Evaluating inferences for bias indicators Adding human oversight loops IV. Building Ethical Guardrails for Local LLMs (10 mins) Auditing training data composition Establishing process transparency Benchmarking rigorously on safety Implementing accountability for production systems V. The Future of Responsible Local LLMs (5 mins) Advances that build trust and mitigate harms Policy considerations around generative models Promoting democratization through education

11:00 am - 12:00 pm Responsible AI

Strategies for Implementing Responsible AI Governance and Risk Management

Beatrice Botti VP – Chief Privacy Officer at Double Verify

Harnessing the transformative power of AI requires more than just algorithms. Join us for a deep dive into the essential principles and practices that mitigate risk, ensure compliance, and build trust in your AI initiatives. This session transcends the hype and cuts through the fear to equip you with practical tools and frameworks for responsible AI implementation. Bias, data privacy, copyright, and risk management loom large, often outpacing enterprises' ability to navigate them effectively. This session will empower data scientists and business leaders to build trustworthy AI systems by exploring the true harms and risks of AI systems, unpacking the principles of trustworthy AI, and discovering the key elements for robust governance and risk management practices. Learn how understanding data and AI governance can empower data scientists by streamlining workflows, minimizing rework due to data issues, and foster a culture of responsible innovation. You will also gain knowledge to be able to effectively navigate the regulatory landscape and demystify the legal frameworks impacting AI, including emerging laws and regulations, and understand their implications for your data science team. Turn data and AI governance into your secret weapon. Build solutions that not only excel but stand out in the marketplace and build competitive advantage, fostering trust and loyalty with every interaction, and differentiative your products and brand.

11:10 am - 12:10 pm Generative AI

Everything About Large Language Models: Pre-training, Fine-tuning, RLHF & State of the Art

Chandra Khatri VP, Head of AI at Krutrim

Generative Large Language Models like GPT4 have revolutionized the entire tech ecosystem. But what makes them so powerful? What are the secret components which make them generalize to a variety of tasks? In this talk, I will present how these foundation models are trained. What are the steps and core-components behind these LLMs? I will also cover how smaller, domain-specific models can outperform general purpose foundation models like ChatGPT on target use cases.

11:10 am - 12:10 pm NLP

Machine Learning using PySpark for Text Data Analysis

Bharti Motwani Clinical Associate Professor at University of Maryland, USA

In this session, unsupervised Machine Learning algorithms like Cluster Analysis and recommendation System and supervised Machine Learning algorithms like Random Forest, Decision Tree, Bagging and Boosting will be discussed for doing analysis using PySpark. The main feature of this workshop will be the implementation of these algorithms using the Text Data. Considering the importance of reviews and text data available on social media platforms, the availability and importance of text data analysis has grown multifold. The session will be particularly helpful for startups and existing business who wanted to use AI for improving performance.

12:05 pm - 1:05 pm Machine Learning

Machine Learning with XGBoost

Matt Harrison Python & Data Science Corporate Trainer | Consultant at MetaSnake

This workshop will show how to use XGBoost. It will demonstrate model creation, model tuning, model evaluation, and model interpretation. Session Outline: The XGBoost library is one of the most popular libraries with data scientists for creating predictive models with structured (or tabular) data. This workshop will cover the library, tuning it, evaluating models created by it, and understanding predictions from it. Attendees will have the chance to try it out with the labs. * Installation and Jupyter - 10 min * Creating Models - 30 min * Lab - 20 min * Model Evaluation - 15 min * Model Tuning - 15 min * Lab - 30 min * Model Interpretation - 20 min * Lab - 20 min

12:30 pm - 1:30 pm LLMs

Prompt Engineering with Llama 3

Amit Sangani Director of Partner Engineering at Meta

This session aims to provide hands-on, engaging content that gives developers a basic understanding of Llama 3 models, how to access and use them, understand the architecture and build an AI chatbot using LangChain and Tools. The audience will also learn core concepts around Prompt Engineering and Fine-Tuning and programmatically implement them using Responsible AI principles. Lastly, we will conclude the talk by explaining how they can leverage this powerful tech, different use cases and what the future looks like. Section 1: Understanding Llama 3 Familiarize yourself with Llama 3 models and architecture, how to download, install and access them, and the basic use cases it can accomplish. Additionally, we will review basic completion, system prompts and responses in different formats. Section 2: Prompt Engineering and Chatbot We will walk through the concepts of Prompt Engineering and chatbot architecture, including implementing single-turn and multi-turn chat requests, hallucinations and how to prevent them, augmenting external data using Retrieval Augment Generation (RAG) principles and implementing all of this using LangChain. We will also review advanced concepts around Fine-Tuning. Section 3: Responsible AI and Future We will discuss the basic Responsible AI considerations as you build your Generative AI strategy and applications, safety measures to address context-specific risks, best practices for mitigating potential risks and more. We will also discuss what the future holds in the Generative AI space and give you a glimpse of what to expect from Llama offerings. Basic knowledge of Python and LLM

12:30 pm - 1:30 pm Machine Learning

Introduction to Linear Regression using Spreadsheets with Real Estate Data

Roberto Reif CEO and Founder at ScholarU

Over the course of this session, we'll embark on a deep dive into the foundational principles of linear regression, a statistical machine learning model that aids in unraveling the intricate relationships between two or more variables. Our unique focus centers on the practical application of linear regression using real-world real estate data, offering a concrete context that will undoubtedly resonate with participants. The workshop kicks off with a thorough overview of linear regression concepts, ensuring a collective understanding of the fundamentals. As we progress, we transition into the practical realm, employing popular spreadsheet tools like Excel or Google Sheets to conduct insightful real estate data analyses. Participants will master the art of data input, application of regression formulas, model building, and interpretation of results, enriching their analytical toolkit. The workshop's core revolves around a hands-on exploration of a real-world scenario. Together, we'll dissect a data set featuring crucial real estate variables such as property prices, square footage, number of bedrooms and bathrooms, and location. This pragmatic approach empowers participants to directly apply linear regression concepts to authentic situations commonly encountered in the dynamic field of real estate. Engagement is key throughout our workshop, featuring interactive exercises, group discussions, and dedicated Q&A sessions to reinforce comprehension. By the workshop's conclusion, participants will wield the skills to adeptly leverage the fundamental machine learning model of linear regression for making informed and predictive decisions in the realm of real estate. Whether you're a novice seeking an introduction to regression analysis or a seasoned analyst aiming to refine your skills, this workshop guarantees a stimulating and enlightening experience.

1:40 pm - 2:40 pm Machine Learning

Feature Stores in Practice: Build and Deploy a Model with Featureform, Redis, Databricks, and Sagemaker

Simba Khadder Founder & CEO at Featureform

The term ""Feature Store"" often conjures a simplistic idea of a storage place for features. However, in reality, they serve as robust frameworks and orchestrators for defining, managing, and deploying feature pipelines. The veneer of simplicity often masks the significant operational gains organizations can achieve by integrating the right feature store into their ML platform. This session is designed to peel back the layers of ambiguity surrounding feature stores, delineating the three distinct types and their alignment within a broader ML ecosystem. Diving into a hands-on section, we will walk through the process of training and deploying an end-to-end fraud detection model utilizing Featureform, Redis, Databricks, and Sagemaker. The emphasis will be on real-world, applicable examples, moving beyond concepts and marketing talk. This session aims to do more than just explain the mechanics of feature stores. It provides a practical blueprint to efficiently harness feature stores within ML workflows, effectively bridging the chasm between theoretical understanding and actionable implementation. Participants will walk away with a solid grasp of feature stores, equipped with the knowledge to drive meaningful insights and enhancements in their real-world ML platforms and projects.

1:40 pm - 2:40 pm Generative AI

Graphs: The Next Frontier of GenAI Explainability

Michelle Yi Board Member at Women In Data
Amy Hodler Founder, Consultant at GraphGeeks.org

In a world obsessed with making predictions and generative AI, we often overlook the crucial task of making sense of these predictions and understanding results. If we have no understanding of how and why recommendations are made, if we can’t explain predictions – we can’t trust our resulting decisions and policies. In the realm of predictions, explainability, and causality, graphs have emerged as a powerful model that has recently yielded remarkable breakthroughs. Graphs are purposefully designed to capture and represent the intricate connections between entities, offering a comprehensive framework for understanding complex systems. Leading teams use this framework today to surface directional patterns, compute complex logic, and as a basis for causal inference. This talk will examine the implications of incorporating graphs into the realm of generative AI, exploring the potential for even greater advancements. Learn about foundational concepts such as directed acrylic graphs (DAGs), Jedeau Pearl’s “do” operator, and keeping domain expertise in the loop. You’ll hear how the explainability landscape is evolving, comparisons of graph-based models to other methods, and how we can evaluate the different fairness models available. We’ll look into the open source PyWhy project for causal inference and the DoWhy method for modeling a problem as a causal graph with industry examples. By identifying the assumptions and constraints up front as a graph and applying that through each phase of modeling mechanisms, identifying targets, estimating causal effects, and refuting these with each inference – we can improve the validity of our predictions. We’ll also explore other open source packages that use graphs for counterfactual approaches, such as GeCo and Omega. Join us as we unravel the transformative potential of graphs and their impact on predictive modeling, explainability, and causality in the era of generative AI.

2:00 pm - 3:00 pm LLMS

Enabling Complex Reasoning and Action with ReAct, LLMs, and LangChain

Shelbee Eigenbrode Principal Machine Learning Specialist Solutions Architect at AWS
Giuseppe Zappia Principal Solutions Architect at AWS

ReAct is an approach that uses human reasoning traces to create action plans which determine the best action to take from a selection of available tools that are external to the LLM. This methodology mimics human chain of thought processes combined with the ability to engage with an external environment to solve problems and reduce the likelihood of hallucinations and reasoning errors. In this workshop, you will learn how to employ the ReAct technique to allow an LLM to determine where to find information to service different types of user queries, using LangChain to orchestrate the process. You’ll see how to it uses Retrieval Augmented Generation (RAG) to answer questions based on external data, as well as other tools for performing more specialized tasks to enrich the output of your LLM. All demo code and presentation material will be provided, as well as a temporary Amazon SageMaker Studio environment to build and deploy in.

2:00 pm - 3:00 pm Machine Learning

From Chaos to Control: Mastering Machine Learning Reproducibility at Scale

Amit Kesarwani Director of Solution Architecture at LakeFS

Machine learning workflows are not linear, where experimentation is an iterative & repetitive to and fro process between different components. What this often involves is experimentation with different data labeling techniques, data cleaning, preprocessing and feature selection methods during model training, just to arrive at an accurate model. Quality ML at scale is only possible when we can reproduce a specific iteration of the ML experiment–and this is where data is key. This means capturing the version of training data, ML code and model artifacts at each iteration is mandatory. However, to efficiently version ML experiments without duplicating code, data and models, data versioning tools are critical. Open-source tools like lakeFS make it possible to version all components of ML experiments without the need to keep multiple copies, and as an added benefit, save you storage costs as well. In this one-hour workshop, you'll achieve the following: Master ML Reproducibility: Gain practical experience to achieve full reproducibility for your ML experiments. Learn how to track changes to data, code, and models, allowing you to easily revisit and refine past experiments. This ensures you can recreate past successes and identify potential issues. Enhance Your Existing Stack: Discover how to integrate ML reproducibility seamlessly into your current ML experimentation tools, such as MLflow. Leverage open-source software to create a holistic version control experience, streamlining your workflow and ensuring the reliability of your ML pipelines.

2:50 pm - 3:50 pm Deep Learning

Topological Deep Learning: Going Beyond Graph Data

Dr. Mustafa Hajij Assistant Professor at University of San Francisco

Over the past decade, deep learning has been remarkably successful at solving a massive set of problems on datatypes including images and sequential data. This success drove the extension of deep learning to other discrete domains such as sets, point clouds, graphs, 3D shapes, and discrete manifolds. While many of the extended schemes have successfully tackled notable challenges in each domain, the plethora of fragmented frameworks have created or resurfaced many long-standing problems in deep learning such as explainability, expressiveness and generalizability. Moreover, theoretical development proven over one discrete domain does not naturally apply to the other domains. Finally, the lack of a cohesive mathematical framework has created many ad hoc and inorganic implementations and ultimately limited the set of practitioners that can potentially benefit from deep learning technologies. This talk introduces the foundation of topological deep learning, a rapidly growing field that is concerned with the development of deep learning models for data supported on topological domains such as simplicial complexes, cell complexes, and hypergraphs, which generalize many domains encountered in scientific computations including images and sequence data. It introduces the main notions while maintaining intuitive conceptualization, implementation and relevance to a wide range of practical applications. It also demonstrates the practical relevance of this framework with practical applications ranging from drug discovery to mesh and image segmentation.

3:30 pm - 4:30 pm ML Safety

Trojan Model Hubs: Hacking ML Supply Chains & Defending Yourself from Threats

Sam Washko Software Engineer at Protect AI
William Armiros Senior Software Engineer at Protect AI

Increasingly, ML practitioners have become reliant on public model hubs like Hugging Face for downloading foundation models to fine tune. However, due to the open nature of model hubs, compromised artifacts are very easy to share and distribute. Most ML model formats are inherently vulnerable to Model Serialization Attacks (MSA), the injection of malicious code that will execute automatically when the model file is deserialized. MSAs are the Trojan horses of ML, capable of turning a seemingly innocuous model into a backdoor to your whole system. An attacker could download a popular model, inject malicious code, and upload it under a similar name to trick consumers. This problem is not purely theoretical: 3,354 public models on Hugging Face today are capable of arbitrary code execution upon serialization, 41% of which are not flagged as unsafe by Hugging Face. Even beyond the risk of public registries, privately created models can also be subject of MSAs if their storage system is infiltrated by swapping out a safe model for one that makes identical inferences but also executes malicious code. So, what can you do about it? In this talk, we will explore two strategies to mitigate the risk of MSAs and other attacks involving compromised artifacts: model scanning and cryptographic signing. Model scanning is our window into the black boxes that are model files. By scanning the model before deserialization, we can examine the operators and layers it uses to determine whether it contains suspicious code, without actually unpacking it and becoming vulnerable to the attack. In addition, cryptographic attestation can link an artifact to a source’s identity, backed up by a trusted authority. A model can be signed on creation, then whenever it’s used, users can verify the signature to establish integrity and authenticity. Scan results can also be signed, verifying that the creator ensured the model was safe from malicious code at the time of signing.

3:30 pm - 5:30 pm

Lights, Camera, AI Action: Building a Movie Pitch by Combining Generative and Predictive AI with DataRobot

Luke Shulman Lead Data Scientist at Datarobot

Session Abstract Coming Soon!

4:35 pm - 5:35 pm Data Visualization

Unlocking Insights in Home Values: A Multimillion-Row Journey with Polars

Gleb Drobkov Co-Founder & Head of Strategy at Charles River Data
Mike Dezube Founder and CEO at Charles River Data

Join us for a hands-on data adventure exploring hidden insights and nuances across all home and building values in Massachusetts. With a dataset containing 2.5 million rows, this workshop will showcase the incredible capabilities of Polars, a data manipulation library which partners well with Pandas, in handling extensive data with a clean API, high performance and a low memory footprint, all on your local machine. Throughout the session, we'll demonstrate how Polars empowers users to perform nuanced analyses, such as pinpointing the most expensive homes in every town and on every street in Massachusetts, or unraveling the factors influencing home prices such as style, location, acreage, year built, square footage, etc. Whether you're a data enthusiast, analyst, or someone intrigued by the power of data analysis, this interactive workshop will leave you equipped to harness Polars' full potential for your own data exploration endeavors. Plus you’ll have a fun dataset of all home and building values (per tax assessment) at your fingertips. Time permitting, we’ll also do some GIS analysis on the dataset. Don't miss this opportunity to discover the stories hidden within the numbers and elevate your data analysis skills to new heights. Come prepared to write code in a jupyter notebook/jupyter lab, and leave with a working model and the full dataset.

4:35 pm - 5:35 pm

Introduction to Containers for Data Science / Data Engineering

Michael A Fudge Professor of Practice, MSIS Program Director at Syracuse University’s iSchool

In this hands on session, participants will learn how to leverage containers for data science / data engineering workflows. Containers allows us to bundle our application dependencies and configuration into an image, which can be more easily shared with others and deployed to the cloud. The session will explain how to use and build images, configure and run them and handle inter-dependencies between the product you're building and other services such as databases. Specifics covered: - how containers work - what are their advantages in data science / data engineering - finding images on repositories (docker hub / quay.io) - creating containers from the image and running it - exposing resources like ports and volumes - orchestration with docker-compose - building / configuring / customizing images to include your specific project dependencies - integrating your container with the visual studio code editors - containerizing dependent services like databases and integrating them with your project Source code from the workshop will be available for attendees on github.

4:35 pm - 5:35 pm

HPCC Systems® for Social Good – Safe Havens!

Bob Foreman Senior Software Engineer at LexisNexis Risk Solutions

Session Abstract Coming Soon!

11:00 am - 12:00 pm Deep Learning

End-to-End Deep Learning for Time Series Forecasting and Analysis

Isaac Godfried Senior Data Scientist at SimSpace

Deep learning has made major strides in computer vision and NLP. Transformers and LLMs now dominate most NLP applications both in industry and academia. Although not quite with the volume or hype of NLP a large variety of academic research has also shown the utility of deep learning models such as the transformer or LSTMs for time series forecasting and classification. However, many companies still prefer to use simple methods like Linear Regression or ARIMA. Moreover, training and generalizing deep time series models to forecest/classify real world business data still has a high learning curve and on the surface is out of reach to many businesses. In this talk I will dive into several open source deep learning for time series frameworks (Flow Forecast, PyTorch Geometric Temporal, HuggingFace, etc) that aim to ease the process of model selection, fine-tuning, feature selection and evaluation. We will walk through a real world time series forecasting problem step-by-step with data collection, pre-processing, model training, hyper-parameter tuning, validation, and deployment. We will also discuss the tradeoffs of deep learning over traditional methods for time series forecasting and analysis. Finally, we will touch upon some open areas of ground breaking research such as multi-modal time series forecasting and generalized pre-training of time series models (similar to in NLP), and utilizing neural ordinary differential equations. Participants will leave with a solid understanding of how to get started utilizing OSS frameworks to forecast and analyze their time series data. They will also gain knowledge of what areas of research to monitor in the time series space and how these techniques will be applicable to their datasets.

12:05 pm - 1:05 pm Generative AI

Harnessing GPT Assistants for Superior Model Ensembles: A Beginner's Guide to AI Stacked Classifiers

Jason Merwin, PhD Data Scientist II | Western Governors University

Tutorial: OpenAI’s API allows users to programmatically create custom GPTs, referred to as Assistants, which can be instructed to write and execute code on provided data. This opens many exciting possibilities in data science, in particular the use of multiple Assistants to help build large scale, powerful machine learning ensemble methods that might otherwise be unfeasible. Model stacking is an advanced machine learning technique where multiple base models, typically of different types, are trained on the same data and their predictions used as input for a final ""meta-model"". While it is a powerful technique, stacking is generally impractical for most data scientists due to its heavy resource requirements and time-consuming architecture. However, by creating multiple AI Assistants through the API, these types of multi-model ensembles can be easily and quickly created. In this presentation, I will show how a single user with a beginner level knowledge of python can create a “swarm” of AI Assistants that train a series of models for use in a model-stacking ensemble classifier that outperforms traditional ML models on the same data. We will go over each step from getting set up with the API to orchestrating an AI swarm, to collecting their output for the final Meta model predictions.

9:00 am - 11:00 am Machine Learning

Virtual conference

Introduction to Math for Data Science

Thomas Nield Instructor at University of Southern California, Founder | Nield Consulting Group and Yawman Flight

With the availability of data, there is a growing demand for talent who can analyze and make sense of it. This makes practical math all the more important because it helps infer insights from data. However, mathematics comprises many topics, and it is hard to identify which ones are applicable and relevant for a data science career. Knowing these essential math topics is key to integrating knowledge across data science, statistics, and machine learning. It has become even more important with the prevalance of libraries like PyTorch and scikit-learn, which can create """"""""black box"""""""" approaches where data science professionals use these libraries but do not fully understand how they work. In this training, Thomas Nield (author of O'Reilly book """"""""Essential Math for Data Science"""""""") will provide a crash-course of carefully curated topics to jumpstart proficiency in key areas of mathematics. This includes probability, statistics, hypothesis testing, and linear algebra. Along the way you’ll integrate what you’ve learned and see practical applications for real-world problems. These examples include how statistical concepts apply to machine learning, and how linear algebra is used to fit a linear regression. We will also use Python to explore ideas in calculus and model-fitting, using a combination of libraries and from-scratch approaches.

9:00 am - 11:00 am Machine Learning

Data Wrangling with Python

Sheamus McGovern CEO and Software Architect, Data Engineer, and AI expert at ODSC

Data wrangling is the cornerstone of any data-driven project, and Python stands as one of the most powerful tools in this domain. In preparation for the ODSC conference, our specially designed course on “Data Wrangling with Python” offers attendees a hands-on experience to master the essential techniques. From cleaning and transforming raw data to making it ready for analysis, this course will equip you with the skills needed to handle real-world data challenges. As part of a comprehensive series leading up to the conference, this course not only lays the foundation for more advanced AI topics but also aligns with the industry’s most popular coding language. Upon completion of this short course attendees will be fully equipped with the knowledge and skills to manage the data lifecycle and turn raw data into actionable insights, setting the stage for advanced data analysis and AI applications.

11:30 am - 1:30 pm Data Visualization

Virtual conference

A Practical Introduction to Data Visualization for Data Scientists

Robert Kosara Data Visualization Developer at Observable

How does data visualization work, and what can it do for you? In this workshop, data visualization researcher and developer Robert Kosara will teach you the basics of how and why to visualize data, and show you how to create interactive charts using open-source tools. You'll learn… - the fundamental building blocks of data visualization: visual variables, data mappings, etc. - the difference between continuous and categorical data, and what it means for data visualization and the use of color - what grammars of graphics are (the 'gg' in 'ggplot'!) and how they help make more interesting visualizations - the basic chart types, how they work, and what they're best used for - a few unusual chart types and when to use them - how to prepare data for common data visualization tools - how to build a simple interactive modeling tool that combines observed and modeled data in a single visualization - when to use common charts vs. when to go for bespoke or unusual visualizations We'll build all these visualizations using the open-source Observable Plot framework, but the concepts apply similarly to many others (such as ggplot, vega-lite, etc.). To follow along, you'll need a computer with an editor (such as Visual Studio Code) as well as a download of the project we provide (see the prerequisites).

11:30 am - 1:30 pm Machine Learning

Introduction to Machine Learning with Python

Sudip Shrestha, PhD Data Science Lead/ Sr. Manager at Asi Government

The ""Introduction to Machine Learning with Python"" is designed for those seeking to understand the growing field of Machine Learning (ML), a key driver in today’s data-centric world. This training offers foundational knowledge in ML, emphasizing its importance in various industries for informed decision-making and technological advancements. Participants will learn about different ML types, including supervised and unsupervised learning, and explore the complete lifecycle of an ML model—from data preprocessing to deployment. The course highlights Python’s role in ML, introducing essential tools and libraries for algorithm implementation. A practical component involves hands-on implementation of an ML use case, consolidating theoretical knowledge with real-world application. Ideal for beginners, this course provides a comprehensive yet concise introduction to ML, equipping attendees with the skills to apply ML concepts effectively in diverse scenarios.

Machine Learning

Virtual conference

Data Primer Course (Self-paced)

ODSC Instructor

Data is the essential building block of Data Science, Machine Learning and AI. This course is the first in the series and is designed to teach you the foundational skills and knowledge required to understand, work with, and analyze data. It covers topics such as data collection, organization, profiling, and transformation as well as basic analysis. The course is aimed at helping people begin their AI journey and gain valuable insights that we will build up in subsequent SQL, programming, and AI courses.

Large Language Models

Virtual conference

Introduction to Large Language Models

ODSC Instructor

This hands-on course serves as a comprehensive introduction to Large Language Models (LLMs), covering a spectrum of topics from their differentiation from other language models to their underlying architecture and practical applications. It delves into the technical aspects, such as the transformer architecture and the attention mechanism, which are the cornerstones of modern language models. The course also explores the applications of LLMs, focusing on zero-shot learning, few-shot learning, and fine-tuning, which showcase the models’ ability to adapt and perform tasks with limited to no examples. Furthermore, it introduces the concept of flow chaining as a method to generate coherent and extended text, demonstrating its usefulness in tackling token limitations in real-world scenarios such as Q&A bots. Through practical examples and code snippets, participants are given a hands-on experience on how to utilize and harness the power of LLMs across various domains. By utilizing the code notebooks included in this course, participants can code alongside the code instructor to ensure hands-on practice experience in LLMs

Natural Language Processing

Virtual conference

Introduction to NLP

ODSC Instructor

Welcome to the Introduction to NLP workshop! In this workshop, you will learn the fundamentals of Natural Language Processing. From tokenization and stop word removal to advanced topics like deep learning and large language models, you will explore techniques for text preprocessing, word embeddings, classic machine learning, and cutting-edge NLP methods. Get ready to dive into the exciting world of NLP and its applications!

Machine Learning

Virtual conference

Introduction to R

ODSC Instructor

Dive into the world of R programming in this interactive workshop, designed to hone your data analysis and visualization skills. Begin with a walkthrough of the Colab interface, understanding cell manipulation and library utilization. Explore core R data structures like vectors, lists, and data frames, and learn data wrangling techniques to manipulate and analyze datasets. Grasp the basics of programming with iterations and function applications, transitioning into Exploratory Data Analysis (EDA) to derive insights from your data. Discover data visualization using ggplot2, unveiling the stories hidden within data. Lastly, get acquainted with RStudio, the robust Integrated Development Environment, enhancing your R programming journey. This workshop is your gateway to mastering R, catering to both novices and seasoned programmers.

Machine Learning

Virtual conference

Introduction to AI (Self-paced)

ODSC Instructor

This AI literacy course is designed to introduce participants to the basics of artificial intelligence (AI) and machine learning. We will first explore the various types of AI and then progress to understand fundamental concepts such as algorithms, features, and models. We will study the machine learning workflow and how it is used to design, build, and deploy models that can learn from data to make predictions. This will cover model training and types of machine learning including supervised, and unsupervised learning, as well as some of the most common models such as regression and k-means clustering. Upon completion, individuals will have foundational understanding of machine learning and its capabilities and be well-positioned to take advantage of introductory-level hands-on training in machine learning and data science such as ODSDC East’s Mini-Bootcamp.

9:30 am - 11:30 am LLMs

Should I Use RAG or Fine-Tuning? Building with Llama 3 and Arctic Embed

Chris Alexiuk Head of LLMs at AI Makerspace | Founding Machine Learning Engineer at Ox
Greg Loughnane Co-Founder & CEO at AI Makerspace

One question we get a lot as we teach students around the world to build, ship, and share production-grade LLM applications is “Should I use RAG or fine-tuning?“ The answer is yes. You should use RAG AND fine-tuning, especially if you’re aiming at human-level performance in production. In 2024 you should be thinking about using agents too! To best understand exactly how and when to use RAG and Supervised Fine-Tuning (a.k.a SFT or just fine-tuning), there are many nuances that we must consider! In this event, we’ll zoom in on prototyping LLM applications and describe how practitioners should think about leveraging the patterns of RAG, fine-tuning, and agentic reasoning. We’ll dive into RAG and how fine-tuned models and agents are typically leveraged within RAG applications. Specifically, we will break down Retrieval Augmented Generation into dense vector retrieval plus in-context learning. With this in mind, we’ll articulate the primary forms of fine-tuning you need to know, including task training, constraining the I-O schema, and language training in detail. Finally, we’ll demystify the language behind the oft-confused terms agent, agent-like, and agentic by describing the simple meta-pattern of reasoning-action and its fundamental roots in if-then thinking. Finally, we’ll provide an end-to-end domain-adapted RAG application to solve a use case. All code will be demoed live, including what is necessary to build our RAG application with LangChain v0.1 and to fine-tune an open-source embedding model from Hugging Face! You’ll learn: - RAG and fine-tuning are not alternatives, but rather two pieces to the puzzle - RAG, fine-tuning, and agents are not specific *things.* They are patterns. - How to build a RAG application using fine-tuned domain-adapted embeddings **Who should attend the event?** - Any GenAI practitioner who has asked themselves “Should I use RAG or fine-tuning?” - Aspiring AI Engineers looking to build and fine-tune complex LLM applications - AI Engineering leaders who want to understand the primary patters for GenAI prototypes Module 1: The Patterns of GenAI We will break down Retrieval Augmented Generation into dense vector retrieval plus in-context learning. With this in mind, we’ll articulate the primary forms of fine-tuning you need to know, including task training, constraining the I-O schema, and language training in detail. Finally, we’ll demystify the language behind the oft-confused terms agent, agent-like, and agentic by describing the simple meta-pattern of reasoning-action and its fundamental roots in if-then thinking. Module 2: Building a simple RAG application with LangChain v0.1 and Llama 3 Leveraging LangChain Expression Language and LangChain v0.1, we’ll build a simple RAG prototype using OpenAI’s GPT 3.5 Turbo, OpenAI’s text-3-embedding-small, and a FAISS vector store! Module 3: Fine-Tuning an Open-Source Embedding Model Leveraging Quantization via the bitsandbytes library, Low Rank Adaptation (LoRA) via the Hugging Face PEFT library, and the Massive Text Embedding Benchmark leaderboard, we’ll adapt the embedding space of our off-the-shelf model (Arctic Embed) to a particular domain! Module 4: Constructing a Domain-Adapted RAG System In the final module, we’ll assemble our domain-adapted RAG system, and discuss where we might leverage agentic reasoning if we kept building the system in the future!

9:30 am - 11:30 am Generative AI

Generative AI

Leonardo De Marchi VP of Labs at Thomson Reuters

Creativity is now not only a human exclusive. This workshop is designed to explore how artificial intelligence can be used to generate creative outputs and to inspire technical audiences to use their skills in new and creative ways. The workshop will also include a series of code exercises designed to give participants hands-on experience working with AI models to generate creative outputs. Some of the exercises we will cover include: - Generating poetry using NLP models like LSTM and Transformer. - Creating digital art using computer vision models like Deep Dream and StyleGAN. - Generating music using GANs and other AI models. Using reinforcement learning to generate creative outputs that match certain criteria or goals. Overall, this workshop is ideal for technical audiences who are interested in exploring the creative possibilities of artificial intelligence. Participants should have a basic understanding of machine learning concepts and be comfortable coding in Python. Join us at odsc.com to discover new ways of using AI to create, innovate and inspire! We will cover a variety of topics related to creativity in AI, including: - Introduction to Creativity in AI: An overview of the different types of AI models and how they can be used to generate creative outputs. - Natural Language Processing (NLP) for Creativity: A deep dive into how NLP can be used to generate creative outputs like poetry, song lyrics, and prose. - Computer Vision for Creativity: How computer vision can be used to generate creative outputs like art and graphic design. - Reinforcement Learning for Creativity: How reinforcement learning can be used to train AI models to generate creative outputs that match certain criteria or goals. - Ethical and Legal Considerations in AI: The ethical implications of using AI to generate creative outputs and how to ensure that these models are used responsibly and ethically. Tools: We will use OpenAI gym to try our RL algorithms. OpenAI is a non profit organization that wants to open source all their research on Artificial Intelligence. To foster innovation OpenAI created a virtual environment, OpenAi gym, where it’s easy to test Reinforcement Learning algorithms. In particular, we will look at popular techniques like Multi Armed Bandit, SARSA and Q-Learning with practical python examples.

11:00 am - 1:00 pm Machine Learning

Data Wrangling with SQL

Sheamus McGovern CEO and Software Architect, Data Engineer, and AI expert at ODSC

This SQL coding course teaches students the basics of Structured Query Language, which is a standard programming language used for managing and manipulating data and an essential tool in AI. The course covers topics such as database design and normalization, data wrangling, aggregate functions, subqueries, and join operations, and students will learn how to design and write SQL code to solve real-world problems. Upon completion, students will have a strong foundation in SQL and be able to use it effectively to extract insights from data. The ability to effectively access, retrieve, and manipulate data using SQL is essential for data cleaning, pre-processing, and exploration, which are crucial steps in any data science or machine learning project. Additionally, SQL is widely used in industry, making it a valuable skill for professionals in the field. This course builds upon the earlier data course in the series.

11:00 am - 4:30 pm Deep Learning

Deep Learning with PyTorch and TensorFlow

Dr. Jon Krohn Chief Data Scientist at Nebula.io

Deep Learning is ubiquitous today across data-driven applications as diverse as generative A.I., natural language processing, machine vision, and superhuman game-playing. This workshop is an introduction to Deep Learning that brings high-level theory to life with interactive examples featuring PyTorch, TensorFlow and Keras — all three of the principal Python libraries for Deep Learning. Essential theory will be covered in a manner that provides students with a complete intuitive understanding of Deep Learning’s underlying foundations. Paired with hands-on code demos in Jupyter notebooks as well as strategic advice for overcoming common pitfalls, this foundational knowledge will empower individuals with no previous understanding of artificial neural networks to train Deep Learning models following all of the latest best-practices. Session Outline: Lesson 1: The Unreasonable Effectiveness of Deep Learning Training Overview Introduction to Neural Networks and Deep Learning The Deep Learning Families and Libraries Lesson 2: Essential Deep Learning Theory The Cart Before the Horse: A Shallow Neural Network Learning with Artificial Neurons TensorFlow Playground—Visualizing a Deep Net in Action Lesson 3: Deep Learning with PyTorch and TensorFlow Revisiting our Shallow Neural Network Deep Nets Convolutional Neural Networks

12:00 pm - 2:00 pm Data Visualization

Visualization in Bayesian Workflow Using Python or R

Clinton Brownley, PhD Lead Data Scientist at Tala

Visualization can be a powerful tool to help you build better statistical models. In this tutorial, you will learn how to create and interpret visualizations that are useful in each step of a Bayesian regression workflow. A Bayesian workflow includes the three steps of (1) model building, (2) model interpretation, and (3) model checking/improvement, along with model comparison. Visualization is helpful in each of these steps – generating graphical representations of the model and plotting prior distributions aid model building, visualizing MCMC diagnostics and plotting posterior distributions aid interpretation, and plotting posterior predictive, counterfactual, and model comparisons aid model checking/improvement.

2:00 pm - 4:30 pm LLMs

Ben Needs a Friend - An intro to building Large Language Model applications

Benjamin Batorsky, PhD Data Science Consultant at d3lve

People say it’s difficult to make friends after college, impossible after grad school and just generally to give up after 30. Approaching 40 - I’ve decided to take matters into my own hands. Rather than go outside and meet people, I’ve decided, like many top-tier companies, to replace all that manual work with AI. In this tutorial, I’ll show you how to make your own AI friend, powered by Large Language Models (LLM). Along the way, we’ll cover some of the essential topics in LLM development. Our first step will be adjusting our new friend to our preferences based on prompt engineering and fine-tuning. Then, we will develop a “history” of our friendship using document embeddings and enable our friend to discuss that history (Retrieval-Augmented Generation). Finally, we will provide our friend with the tools it needs to be able to invite us to interesting local events. We’ll use the LangChain and transformers libraries to explore the pros and cons of different open and closed-source implementations in terms of cost and performance. The methods we’ll be using can be hosted locally and are either free or have minimal cost (e.g. OpenAI APIs). By the end of the tutorial, participants will have a basic familiarity with how to use the latest tools for LLM development and, for anything they’re not clear on, they can always ask their new AI friend for advice. We’ll conclude with a discussion of what our friend can and cannot do and why it may be better to just go outside more.

2:00 pm - 4:30 pm Machine Learning

Introduction to scikit-learn: Machine Learning in Python

Thomas J. Fan Senior Machine Learning Engineer at Union.ai

Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. We start this training by learning about scikit-learn's API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit to build models, predict to make predictions from models, and transform to modify data. This consistent and straightforward interface helps to abstract away the algorithm, thus allowing us to focus on our domain-specific problems. First, we learn the importance of splitting your data into train and test sets for model evaluation. Then, we explore the preprocessing techniques on numerical, categorical, and missing data. We see how different machine learning models are impacted by preprocessing. For example, linear and distance-based models require standardization, but tree-based models do not. We explore how to use the Pandas output API, which allows scikit-learn's transformers to output Pandas DataFrames! The Pandas output API enables us to connect the feature names with the state of a machine learning model. Next, we learn about the Pipeline, which connects transformers with a classifier or regressor to build a data flow where the output of one model is the input of another. Lastly, we look at scikit-learn's Histogram-based Gradient Boosting model, which can natively handle numerical and categorical data with missing values. After this training, you will have the foundations to apply scikit-learn to your machine learning problems.

2:10 pm - 4:10 pm Generative AI

Generative AI, AI Agents, and AGI - How New Advancements in AI Will Improve the Products We Build

Martin Musiol Co-Founder and Instructor at Generative AI.net| Principal Data Science Manager at Infosys Consulting

This session is tailored for professionals seeking to master the fundamentals of generative AI. Our training covers a comprehensive range of topics, from the basics of text generation using advanced language models to the intricacies of image and 3-D object generation. Attendees will gain hands-on experience with cutting-edge tools, empowering them to become ten times more productive in their roles. A key component of our training is the exploration of autonomous agents. Participants will learn not only how these agents perform various tasks autonomously but also how to build one from the ground up. This segment paves the way to understanding the trajectory towards artificial general intelligence (AGI), a frontier in AI research. This session does not require prior experience in AI, making it accessible to a broad audience. However, it promises maximum knowledge gain, equipping attendees with practical skills and theoretical knowledge. By the end of the session, participants will be able to apply these insights directly to their roles, enhancing their contribution to the AI domain and their respective industries. It will be a comprehensive learning experience, ensuring attendees leave with a profound understanding of generative AI and its applications.

Machine Learning

Virtual conference

SQL Primer Course (Self-paced)

ODSC Instructor

This SQL coding course teaches students the basics of Structured Query Language, which is a standard programming language used for managing and manipulating data and an essential tool in AI. The course covers topics such as database design and normalization, data wrangling, aggregate functions, subqueries, and join operations, and students will learn now to design and write SQL code to solve real-world problems. Upon completion, students will have a strong foundation in SQL and be able to use it effectively to extract insights from data. The ability to effectively access, retrieve, and manipulate data using SQL is essential for data cleaning, pre-processing, and exploration, which are crucial steps in any data science or machine learning project. Additionally, SQL is widely used in industry, making it a valuable skill for professionals in the field. This course builds upon the earlier data course in the series.

Machine Learning

Virtual conference

Data Wrangling with Python (Self-paced)

ODSC Instructor

Data wrangling is the cornerstone of any data-driven project, and Python stands as one of the most powerful tools in this domain. In preparation for the ODSC conference, our specially designed course on “Data Wrangling with Python” offers attendees a hands-on experience to master the essential techniques. From cleaning and transforming raw data to making it ready for analysis, this course will equip you with the skills needed to handle real-world data challenges. As prat of a comprehensive series leading up to the conference, this course not only lays the foundation for more advanced AI topics but also aligns with the industry’s most popular coding language. Upon completion of this short course, attendees will be fully equipped with the knowledge and skills to manage the data lifecycle and turn raw data into actionable insights, setting the stage for advanced data analysis and AI applications.

Prompt Engineering

Virtual conference

Prompt Engineering Fundamentals (self-paced)

ODSC Instructor

This workshop on Prompt Engineering explores the pivotal role of prompts in guiding Large Language Models (LLMs) like ChatGPT to generate desired responses. It emphasizes how prompts provide context, control output style and tone, aid in precise information retrieval, offer task-specific guidance, and ensure ethical AI usage. Through practical examples, participants learn how varying prompts can yield diverse responses, highlighting the importance of well-crafted prompts in achieving relevant and accurate text generation. Additionally, the workshop introduces temperature control to balance creativity and coherence in model outputs, and showcases LangChain, a Python library, to simplify prompt construction. Participants are equipped with practical tools and techniques to harness the potential of prompt engineering effectively, enhancing their interaction with LLMs across various contexts and tasks.

Large Language Models

Virtual conference

Build a Question & Answering Bot (self-paced)

ODSC Instructor

The workshop notebook delves into building a Question and Answering Bot based on a fixed knowledge base, covering the integration of concepts discussed in earlier notebooks about LLMs (Large Language Models) and prompting. Initially, it introduces a high-level architecture focusing on vector search—a method to retrieve similar items based on vector representations. The notebook explains the steps involved in vector search including vector representation, indexing, querying, similarity measurement, and retrieval, detailing various technologies used for vector search such as vector libraries, vector databases, and vector plugins. The example utilizes an Open Source vector database, Chroma, to index data and uses state-of-the-union text data for the exercise. The notebook then transitions into the practical implementation, illustrating how text data is loaded, chunked into smaller pieces for effective vector search, and mapped into numeric vectors using the MPNetModel from the SentenceTransformer library via HuggingFace. Following this, the focus shifts to text generation where Langchain Chains are introduced. Chains, as described, allow for more complex applications by chaining several steps and models together into pipelines. A RetrievalQA chain is used to build a Q&A Bot application which utilizes an OpenAI chat model for text generation.

11:00 am - 1:00 pm LLMs

LLMs Meet Google Cloud: A New Frontier in Big Data Analytics

Rohan Johar Customer Engineer (AI)
Mohammad Soltanieh-ha, PhD Clinical Assistant Professor at Boston University

Dive into the world of cloud computing and big data analytics with Google Cloud's advanced tools and big data capabilities. Designed for industry professionals eager to master cloud-based big data tools, this workshop offers hands-on experience with various big data analytics tools, such as Dataproc, BigQuery, Cloud Storage, and Compute Engine. We will also dive into the new LLM capabilities of Google Cloud. Explore how these innovative AI models can extract deeper insights, generate creative text, and automate large-scale tasks, taking your big data analysis to the next level. Ideal for those new to cloud computing, the workshop includes lab offerings that provide practical experience in utilizing Google Cloud services. Participants must bring a laptop to access content and partake in hands-on labs. Upon completion, attendees will have a comprehensive understanding of fundamental cloud computing concepts, key Google Cloud services, and widely used big data tools such as Spark. Agenda: - Getting set up with GCP - Overview of educational resources - Hands-on labs - Cloud Compute and storage - BigQuery for data access and pre-processing - BigQuery for ML and LLM capabilities - Dataproc for distributed computing with Apache Spark - Conclusion

11:00 am - 1:00 pm Machine Learning

Developing Credit Scoring Models for Banking and Beyond

Aric LaBarr, PhD Associate Professor of Analytics at Institute for Advanced Analytics, NC State University

Classification scorecards are a great way to predict outcomes because the techniques used in the banking industry specialize in interpretability, predictive power, and ease of deployment. The banking industry has long used credit scoring to determine credit risk—the likelihood a particular loan will be paid back. However, the main aspect of credit score modeling is the strategic binning of variables that make up a credit scorecard. This strategic and analytical binning of variables provides benefits to any modeling in any industry that needs interpretable models. These scorecards are a common way of displaying the patterns found in a machine learning classification model—typically a logistic regression model, but any classification model will benefit from a scorecard layer. However, to be useful the results of the scorecard must be easy to interpret. The main goal of a credit score and scorecard is to provide a clear and intuitive way of presenting classification model results. This training will help the audience work through how to build successful credit scoring models in both R and Python. It will also teach the audience to layer the interpretable scorecard on top of these models for ease of implementation, interpretation, and decision making. After this training, the audience will have the knowledge to be able to build more complete models that are ready to be deployed and used for better decisions by executives.

11:00 am - 4:30 pm Generative AI

Generative A.I. with Open-Source LLMs: From Training to Deployment with Hugging Face and PyTorch Lightning

Dr. Jon Krohn Chief Data Scientist at Nebula.io

At an unprecedented pace, Large Language Models (like the GPT, Llama, Gemini and Gemma series) are transforming the world in general and the field of data science in particular. This training introduces deep learning transformer architectures, covering how LLMs are used for natural language processing, with a special focus on generative A.I. applications. Brought to life via hands-on code demos that leverage the Hugging Face Transformers and PyTorch Lightning Python libraries, this training covers the latest best-practices across the full lifecycle of LLM development, from training to production deployment. Session Outline: Module 1: Introduction to Large Language Models - Transformer Architectures Module 2: The Breadth of LLM Capabilities - OpenAI APIs, including GPT-4 Module 3: Training and Deploying LLMs - Hugging Face models - Training with PyTorch Lightning - Streaming data sets - Deployment considerations - Parameter-efficient fine-tuning (PEFT) with low-rank adaptation (LoRA) - Single-GPU models - Multiple GPUs Module 4: Getting Commercial Value from LLMs - Tasks that can be Automated - Tasks that can be Augmented - Guidance for Successful A.I. Teams and Projects Parts of this training will be accessible to anyone who would like to understand how to develop commercially-successful data products in the new paradigm unleashed by LLMs like GPT-4. To make the most of this training, attendees should be proficient in deep learning and Python programming.

2:00 pm - 4:30 pm Generative AI

Aligning Open-source LLMs Using Reinforcement Learning from Feedback

Sinan Ozdemir AI & LLM Expert | Author | Founder + CTO at LoopGenius

"Unlock the full potential of open-source Large Language Models (LLMs) in our alignment workshop focused on using supervised fine-tuning (SFT) and reinforcement learning (RL) to optimize LLM performance. With LLMs like ChatGPT and Llama 2 revolutionizing the field of AI, mastering the art of fine-tuning these models for optimal human interaction has become crucial. Throughout the session, we will focus on the core concepts of LLM fine-tuning, with a particular emphasis on reinforcement learning mechanisms. Engaging in hands-on exercises, attendees will gain practical experience in data preprocessing, quality assessment, and implementing reinforcement learning techniques for manual alignment. This skill set is especially valuable for achieving instruction-following capabilities and much more. The workshop will provide a comprehensive understanding of the challenges and intricacies involved in aligning LLMs. By learning to navigate through data preprocessing and quality assessment, participants will gain insights into identifying the most relevant data for fine-tuning LLMs effectively. Moreover, the practical application of reinforcement learning techniques will empower attendees to tailor LLMs for specific tasks, ensuring enhanced performance and precision in real-world applications. By the workshop's conclusion, attendees will be well-equipped to harness the power of open-source LLMs effectively, tailoring their models to meet the specific demands of their industries or domains. Don't miss out on this opportunity to learn how to create your very own instruction-aligned LLM and enhance your AI applications like never before! Lesson 1: Understanding Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF) Attendees will be introduced to the mechanisms of Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF). Throughout this lesson, participants will grasp the core concepts and strategies involved in collecting and leveraging human feedback to align language models. At the end of this lesson, attendees will have a comprehensive understanding of how RLHF and RLAIF can be applied to optimize language models effectively. Lesson 2: Aligning FLAN-T5 for Customized Summaries Participants will be guided through the process of aligning FLAN-T5, a language model, to generate more customized summaries. We will see techniques for aligning the model with specific data or user instructions, enabling FLAN-T5 to produce summaries that cater to individual requirements. By the end of this lesson, participants will be equipped with the skills to create highly personalized summaries with FLAN-T5. Lesson 3: Fine-Tuning Open Source Llama 2 for Instruction Following In this lesson, participants will focus on aligning the open-source Llama 2 to follow instructions. We will see how to optimize LLama 2's behavior by manually aligning it with high-quality data and see the caveats on how pre-training data can affect an aligned model's behavior. By the end of this lesson, participants will have the practical knowledge to effectively fine-tune open source LLMs using supervised fine-tuning and reinforcement learning to improve their ability to understand and respond to instructions. Learning Objectives: Understanding RLHF and RLAIF: Attendees will gain a clear understanding of the mechanisms behind Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF). They will see how these approaches can be leveraged to align language models. Aligning Open-Source LLMs for Customized Use: Participants will learn practical techniques for aligning models with specific data or user instructions to produce tailored generations. Applying RLHF and RLAIF Techniques: After the session, participants will be equipped to apply RLHF and RLAIF techniques to various language models. They will explore real-world use cases across different industries, discovering how to leverage these approaches for human feedback and iterative fine-tuning, leading to more specialized and efficient language models. Open Source Tools: During the presentation, we will primarily use the following open-source tools: Hugging Face Transformers Library: This library offers a range of pre-trained language models, including FLAN-T5 and LLama 2, and allows fine-tuning and alignment for various natural language processing tasks. TRL Library: This library can be used for Reinforcement Learning and can be integrated with code using the Transformers library. Colab: We will conduct hands-on exercises and demonstrations using Colab, providing an interactive environment for attendees to follow along with the practical aspects of the session. GitHub: All code, examples, and resources used in the session will be made available on GitHub, allowing participants to access and refer back to the materials for further exploration and self-study after the workshop. By the end of the session, attendees will be equipped with valuable knowledge and practical skills to align and fine-tune language models effectively using RLHF and RLAIF approaches, unlocking the full potential of these models for various language processing tasks and applications. Background Knowledge: - Loading and creating generations with LLMs using the Transformers library - Fine-tuning LLMs using labeled data in a supervised manner

Large Language Models

Virtual conference

Retrieval-Augmented Generation (self-paced)

ODSC Instructor

Retrieval-Augmented Generation (RAG) is a powerful natural language processing (NLP) architecture introduced in this workshop notebook. RAG combines retrieval and generation models, enhancing language understanding and generation tasks. It consists of a retrieval component, which efficiently searches vast text databases for relevant information, and a generation component, often based on Transformer models, capable of producing coherent responses based on retrieved context. RAG’s versatility extends to various NLP applications, including question answering and text summarization. Additionally, this notebook covers practical aspects such as indexing content, configuring RAG chains, and incorporating prompt engineering, offering a comprehensive introduction to harnessing RAG’s capabilities for NLP tasks.

Large Language Models

Virtual conference

Parameter Efficient Fine tuning (self-paced)

ODSC Instructor

For the next workshop, our focus will be on parameter-efficient fine-tuning (PEFT) techniques in the field of machine learning, specifically within the context of large neural language models like GPT or BERT. PEFT is a powerful approach that allows us to adapt these pre-trained models to specific tasks while minimizing additional parameter overhead. Instead of fine-tuning the entire massive model, PEFT introduces compact, task-specific parameters known as “adapters” into the pre-trained model’s architecture. These adapters enable the model to adapt to new tasks without significantly increasing its size. PEFT strikes a balance between model size and adaptability, making it a crucial technique for real-world applications where computational and memory resources are limited, while still maintaining competitive performance. In this workshop, we will delve into the different PEFT methods, such as additive, selective, re-parameterization, adapter-based, and soft prompt-based approaches, exploring their characteristics, benefits, and practical applications. We will also demonstrate how to implement PEFT using the Hugging Face PEFT library, showcasing its effectiveness in adapting large pre-trained language models to specific tasks. Join us to discover how PEFT can make state-of-the-art language models more accessible and practical for a wide range of natural language processing tasks.

Large Language Models

Virtual conference

LangChain Agents (self-paced)

ODSC Instructor

The “LangChain Agents” workshop delves into the “Agents” component of the LangChain library, offering a deeper understanding of how LangChain integrates Large Language Models (LLMs) with external systems and tools to execute actions. This workshop builds on the concept of “chains,” which can link multiple LLMs to tackle various tasks like classification, text generation, code generation, and more. “Agents” enable LLMs to interact with external systems and tools, making informed decisions based on available options. The workshop explores the different types of agents, such as “Zero-shot ReAct,” “Structured input ReAct,” “OpenAI Functions,” “Conversational,” “Self ask with search,” “ReAct document store,” and “Plan-and-execute agents.” It provides practical code examples, including initializing LLMs, defining tools, creating agents, and demonstrates how these agents can answer questions using external APIs, offering participants a comprehensive overview of LangChain’s agent capabilities.

Large Language Models

Virtual conference

Fine Tuning an Existing LLM (self-paced)

ODSC Instructor

The workshop explores the process of fine-tuning Large Language Models (LLMs) for Natural Language Processing (NLP) tasks. It highlights the motivations for fine-tuning, such as task adaptation, transfer learning, and handling low-data scenarios, using a Yelp Review dataset. The notebook employs the HuggingFace Transformers library, including tokenization with AutoTokenizer, data subset selection, and model choice (BERT-based model). Hyperparameter tuning, evaluation strategy, and metrics are introduced. It also briefly mentions DeepSpeed for optimization and Parameter Efficient Fine-Tuning (PEFT) for resource-efficient fine-tuning, providing a comprehensive introduction to fine-tuning LLMs for NLP tasks.

Large Language Models

Virtual conference

Fine Tuning Embedding Models (self-paced)

ODSC Instructor

This workshop explores the importance of fine-tuning Language and Embedding Models (LLMs). It highlights how embedding models are used to map natural language to vectors, crucial for pipelines with multiple models to adapt to specific data nuances. An example demonstrates fine-tuning an embedding model for legal text. The notebook discusses existing solutions and hardware considerations, emphasizing GPU usage for large data. The practical part of the notebook shows the fine-tuning process of the “distilroberta-base” model from the SentenceTransformer library. It utilizes the QQP_triplets dataset from Quora for training, designed around semantic meaning. The notebook prepares the data, sets up a DataLoader, and employs Triplet Loss to encourage the model to map similar data points closely while distancing dissimilar ones. It concludes by mentioning the training duration and resources needed for further improvements.

Prompt Engineering

Virtual conference

Prompt Engineering with OpenAI (self-paced)

ODSC Instructor

This workshop on prompt engineering with OpenAI discussed best practices for utilizing OpenAI models. We will review how to separate instructions and context using special characters to help improve instruction clarity, context isolation, and enhances control over the generation process. The workshop also included code for installing the langchain library and demonstrated how to create prompts effectively, emphasizing the importance of clarity, specificity, and precision in prompts. Additionally, the workshop showed how to craft prompts for specific tasks, such as extracting entities from text. It provided templates for prompts and highlighted the significance of specifying the desired output format through examples for improved consistency and customization. Lastly, the workshop addressed the importance of using prompts as safety guardrails. It introduced prompts to mitigate hallucination and jailbreaking risks by instructing the model to generate well-supported and verifiable information, thereby promoting responsible and ethical use of language models.

6:00 pm - 8:00 pm Restaurant

Mini-Bootcamp Dinner (bootcamp/vip/biz)

6:00 am - 7:00 am

Morning Run

10:00 am - 5:00 pm

AI Expo/Demo Hall

10:30 am - 11:00 am

Book Author : Effective Pandas : Patterns for Data Manipulation

Matt Harrison Python & Data Science Corporate Trainer | Consultant at MetaSnake

12:00 pm - 1:00 pm Only for invited attendees

Round-Table Discussion: Navigating the AI Revolution in Regulated Industries

As AI shifts from the lab to mainstream application, the regulatory landscapes of financial services and healthcare face both unprecedented opportunities and challenges. This round-table discussion invites data scientists and business managers to explore AI’s transformative impact within these highly regulated sectors. Like the mobile app journey to becoming a mainstream technology, we’ll engage in a dynamic conversation about leveraging AI’s emerging potential. Topics will range from regulatory navigation and AI ethics, to achieving significant ROI and enhancing cybersecurity, all aimed at fostering a holistic understanding of AI’s role in driving innovation while adhering to stringent regulations. Session Goals: To catalyze a dynamic exchange of ideas and experiences related to AI in regulated environments. To outline actionable insights for navigating regulatory complexities with AI. To discuss ethical considerations, technical challenges, and strategies for maximizing AI’s ROI. Opening Remarks Brief introduction to the session’s theme and objectives. Quick round of introductions among participants to foster a collaborative atmosphere. Navigating Regulatory Complexities Sharing experiences with regulatory challenges and successes in AI implementations. Discussing strategies for AI compliance and data protection in sensitive environments. Technical and Ethical Challenges Conversations around overcoming AI’s “black box” issue and ensuring transparent, accountable AI use. Addressing AI bias, cybersecurity threats unique to AI, and integrating AI with legacy systems. Achieving and Measuring ROI Exchanging insights on the potential for ROI from AI initiatives and best practices for measuring success. Sharing perspectives on the economic impact of AI adoption within regulated sectors. Vision for the Future Speculating on upcoming trends and the evolution of AI in financial services and healthcare. Discussing the role of collaborative efforts and think tanks in guiding the future of AI adoption. Closing Thoughts Summarizing key insights and takeaways from the discussion. Encouraging ongoing dialogue and networking among participants post-session.

2:00 pm - 4:00 pm

Women in Data Science Ignite: Sharing Insights and Networking Session

Yi-Chun Lai Student at North Carolina State University
DeAnna Duval Senior Manager, Intelligence Engagement at CompTIA
Roopa Vasan Chief AI Architect at Leidos
Madiha Shakil Mirza NLP Engineer at Avanade

This session is designed to empower and connect women working in the exciting field of data science. Whether you're a seasoned professional or just starting out, this is your chance to: Gain valuable insights from experienced data scientists on a variety of topics relevant to the field (consider mentioning specific examples here, e.g., career advice, overcoming challenges, best practices in a specific area of data science). Share your own experiences and perspectives in an open and supportive environment. Build connections with other women in data science, fostering collaboration and mentorship opportunities. This session is a great opportunity to learn, grow, and be inspired by the incredible women shaping the future of data science.

5:00 pm - 7:00 pm

Networking Reception (VIP)

6:00 pm - 8:00 pm City Table Restaurant & Bar

Ai Social

6:00 am - 7:00 am

Morning Run

10:30 am - 11:00 am

Book Author :Quick Start Guide to Large Language Models

Sinan Ozdemir AI & LLM Expert | Author | Founder + CTO at LoopGenius

11:00 am - 5:00 pm

AI Expo/Demo Hall

1:30 pm - 2:00 pm

Meet the Speaker

Allen Downey, PhD Curriculum Designer at Brilliant.org | Professor Emeritus at Olin College

3:00 pm - 3:30 pm

Book Author : Hands-On Data Analysis with Pandas - Second Edition: A Python Data Science Handbook for Data Collection, Wrangling, Analysis, and Visualization

Stefanie Molin Data Scientist, Software Engineer, Author of Hands-On Data Analysis with Pandas at Bloomberg

5:00 pm - 7:00 pm

Main Networking Reception

5:45 pm - 9:30 pm

Roulette Lightning Talks

Hailey Schoelkopf Research Scientist at EleutherAI
Matt Dzugan Director of Data at Muck Rack
Sinan Ozdemir AI & LLM Expert | Author | Founder + CTO at LoopGenius

The East 2024 Roulette Lightning Talks offer a unique twist on presentations, injecting a dose of surprise and fun into the conference, as speakers must be prepared to speak about any of the slides that will be on the presentation, regardless of whether it is their own slide or not.

6:00 pm - 8:00 pm

Data After Dark

Unwind after a stimulating day at ODSC with fellow AI and data science enthusiasts! Join us for a casual drinks, happy hours, and AI Travia. It’s a great opportunity to network in a relaxed setting, share insights, and forge new connections. Our venues include AI Trivia Night, Live Music & Networking, and Happy Hour Specials.

6:00 am - 7:00 am

Morning Run

6:00 pm - 8:00 pm City Table Restaurant & Bar

Ai Social

Additional 80+ Sessions Coming Soon!

Last Chance | Join Us Live

ODSC IS live

ODSC Pass Guide, Schedule Overview, and FAQs

Which Sessions Are Included in My Pass?

  • ODSC Talks/Keynotes schedule includes Tuesday, April 23–  Thursday, April 25. In-person sessions are available to Silver, Gold, Platinum, Mini-Bootcamp, and VIP Pass holders. Business talks are available to  Ai x Pass holders. Virtual Sessions are available to Virtual Premium, Virtual Platinum & Virtual Mini-Bootcamp pass holders.
  • ODSC Trainings are scheduled from Monday, April 22 –  Wednesday, April 24. In-person sessions are available to Platinum, Mini-Bootcamp, and VIP Pass holders. Virtual Sessions are available to Virtual Platinum & Virtual Mini-Bootcamp pass holders.
  • ODSC Workshop/Tutorials are scheduled from Tuesday, April 23–  Thursday, April 25. All in-person sessions are available to VIP, Platinum, Mini-Bootcamp and Gold pass holders. Silver Pass holders can attend only on Wednesday and Thursday.  Virtual Sessions are available for Virtual Premium, Virtual Platinum & Virtual Mini-Bootcamp pass holders.
  • ODSC Bootcamp Sessions are scheduled VIRTUALLY on Monday, April 22nd, as pre-conference training. They are ONLY available for Mini-Bootcamp, and VIP Pass and Virtual Mini-Bootcamp holders.

Only virtual sessions are recorded. If you have a virtual pass, please note that we will not live-stream any in-person sessions. All Self-paced sessions are also available virtually 

Do I need to pre-register for the session?

Access to all sessions is on a first-come, first-served basis.* Once you have registered, no further action is required to register for specific sessions.

However, you can take steps in order to ensure that you are on time and in the right place including:

    • Picking up your badge early during pre-registration
    • Arriving early to sessions
    • Using the app to view session times and rooms and receive real-time notifications of changes

Important note regarding Training Sessions: Training sessions will start on time as scheduled, and late arrivals may not be granted access to the room once the session is in progress. Please plan to arrive early.

*A limited number of seats in each session room will be reserved for VIP Pass holders.

Where Can I See the Schedule Overview

The ODSC Schedule overview is available on this page

Event Questions?

Attendee guidelines and FAQs can be found HERE

Deep Dive Hands-on Training That Build Certifiable Job-Ready Skills

     Expanded Training: Over 60+ Hands-on Tutorials, Workshops, and training sessions

     Leading Experts: Taught by top instructors who are experienced practitioners in AI and ML

     More Choice:  Choose from 2,3,4 day passes that include IN-PERSON and VIRTUAL options

    Breath and depth: Beginner to Expert season ensures we have all levels covered

Receive your Certification at our on-site Certification Desk

See Workshops and training
AI bootcamp : Learn more

Get Certified with ODSC West 2023

Showcase your new skill sets with certificate courses from ODSC and Ai+
  • ODSC West 2023 Mini-Bootcamp Certification of Completion
  • Ai + Training LLM and Generative AI Certificate Course (Included in Bootcamp and VIP Passes)
  • Ai + Training Machine Learning Certificate Course (Included in Bootcamp and VIP Passes)

ODSC Newsletter

Stay current with the latest news and updates in open source data science. In addition, we’ll inform you about our many upcoming Virtual and in person events in Boston, NYC, Sao Paulo, San Francisco, and London. And keep a lookout for special discount codes, only available to our newsletter subscribers!

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google