Register now

To View the Sessions Below – Click the SESSION TYPE tab and then click THE DAY tab you want to view

Conference Time: ET (UTC – 4) * Please note that we will not be live-streaming in-person sessions – only virtual sessions will be recorded.

9:30 am - 10:00 am Data Engineering

The 12 Factor App for Data

James Bowkett Technical Delivery Director at OpenCredo

DE Summit: Data is everywhere, and so too are data-centric applications. As the world becomes increasingly data-centric, and the volumes of that data increase over time, data engineering will become more and more important. If we're going to be dealing with petabytes of data it will be better to get the fundamentals in place before you start, rather than trying to retrofit best practices onto mountains of data. This only makes a difficult job harder. The 12-factor app helped to define how we think about and design cloud native applications. In this presentation, I will discuss 12 principles of designing data-centric applications that have helped me over the years across 4 categories : Architecture & Design, Quality & Validation (Observability), Audit & Explainability, Consumption. This has ultimately led to our teams delivering data platforms that are both testable and well-tested. The 12 factors also enable them to be upgraded in a safe and controlled manner and will help them get deployed quickly, safely and repeatedly. This talk will be filled with examples and counter examples from the course of my career and the projects that my teams have seen over the years. It will incorporate software engineering best practices and how these apply to data-centric engineering. We hope that you can benefit from some of our experience to create higher quality data-centric applications that scale better and get into production quicker.

11:20 am - 11:50 am Generative AI

Generative AI for Social Good

Colleen Molloy Farrelly Chief Mathematician at Post Urban Ventures

This talk will focus on current generative AI methods, including image and text generation, with a focus on social good applications, including medical imaging applications, diversity training applications, public health initiatives, and underrepresented language applications. We'll start with an overview of common generative AI algorithms for image and text generation before launching into a series of case studies with more specific algorithm overviews and their successes on social good projects. We'll explore an algorithm called TopoGAN that is being used to augment medical image samples. We'll look at GPT-4 and open-source large language models (LLMs) that can generate cases of bias and fairness. We'll consider how language translation and image generators such as stable diffusion can quickly produce public health campaign material. Finally, we'll explore language generation with low-resource languages like Hausa and Swahili, highlighting the potential for language applications in the developing world to aid businesses, governments, and non-profits communicating with local populations. We'll end the talk with a discussion of ethical generative AI and potential for misuse. Learning outcomes include familiarity with common generative AI algorithms and sources, their uses in a variety of settings, and ethical considerations when developing generative AI algorithms. This will equip programming-oriented data scientists with a background to implement algorithms themselves and business-focused analytics professionals with a background to consider strategic initiatives that might benefit from generative AI.

12:10 pm - 12:40 pm Large Language Models

CodeLlama: Open Foundation Models for Code

Baptiste Roziere Research Scientist at Meta

In this session, we will present the methods used to train Code Llama, the performance we obtained, and show how you could use Code Llama in practice for many software development use cases. Code Llama is a family of open large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, 34B, and now 70B parameters each. Code Llama reaches state-of-the-art performance among open models on several code benchmarks. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other open model on MultiPL-E. Code Llama was released under a permissive license that allows for both research and commercial use.

12:50 pm - 1:20 pm Data Visualization

Build-a-Byte: Constructing Your Data Science Toolkit

Jarai Carter, PhD Senior Manager at John Deere

Are you getting started in data science and want to know more about what tools are out there? Or maybe you have used the same tools for a while and you want to try something new? Then, get ready for a deep dive into building your very own data science toolkit! This talk is aimed at beginners who are entering this dynamic field and want to expand their data science tool knowledge. I will highlight 8 important components of a data science toolkit (since there are 8 bits in a byte), such as programming languages, integrated development environments (IDEs), text editors, online resources, and more. You will be guided through essential components of an effective toolkit, such as exploring the differences between your choice of coding playgrounds like Jupyter Notebooks and RStudio. I will navigate through foundational data manipulation libraries such as Pandas and dplyr, and showcase some of my favorite data visualization libraries including Seaborn and ggplot2. As I venture into the realm of machine learning, the audience will gain insights into some of the simplest ways to turn algorithms into models using scikit-learn and caret. The importance of stellar documentation will be emphasized because how else will we be effective technical communicators and collaborators if we cannot explain our work? At the end of the talk, you can expect to leave with not only a fundamental understanding of each toolkit component but also practical insights and resources so you can embark on your own toolkit journey. So, join me in constructing your data science toolkit and build those data masterpieces!

1:10 pm - 1:40 pm Data Engineering

From Research to the Enterprise: Leveraging Large Language Models for Enhanced ETL, Analytics, and Deployment

Ines Chami Co-founder and Chief Scientist at NUMBERS STATION AI

As Foundation Models (FMs) continue to grow in size and capability, data is often left behind in the rush toward solving problems involving documents, images, and videos. This talk will describe our research at Stanford University and Numbers Station AI on applying FMs to structured data and their applications in the modern data stack. Starting with ETL/ELT, we'll discuss our 2022 VLDB paper ""Can Foundation Models wrangle your data?"", the first line of work to use FMs to accelerate tasks like data extraction, cleaning and integration. We'll then move up the stack and discuss our work at Numbers Station to use FMs to accelerate data analytics workflows, by automating tasks like text-to-SQL generation, semantic catalog curation and data visualizations. We will then conclude this talk by discussing challenges and solutions for production deployment in the modern data stack.

1:30 pm - 2:00 pm Generative AI

Programming LLMs for Business Application is Way Better Than 'Tuning' Them

Tsvi Lev Managing Director of the NEC Research Center in Israel at NEC Corporation

In modern enterprises many employees rotinely have to perform tasks related to text understanding and basic manipulation. These include e.g. classification, routing, data extraction into spreadsheets/databases, and formatted summarization into a report. Typically, LLM based approaches to these tasks aim to 'train' or 'fine tune' the LLM to do the same, based on curated labeled data by the organisation itself, and/or external open source datasets from the relevant industry. However, the preparation of such datasets is an expensive, time consuming process. Another issue is that in cases the tuned LLM makes mistakes, it is not clear how many and which new labeled data is required to solve the issue. This makes actual commercial use difficult for many use cases. Humans, of course, can accomplish the same tasks by being given a written or oral policy. In the NEC research labs, we have created an LLM based process of converting policies into 'prompt ensembles' that are used to effectively 'program' the existing, untuned or lightly tuned LLM to give discrete, constrained-by-prompt answers which are then aggregated and filtered to yield the final result. Constraining the results by breaking up the policy to a 'prompt ensemble' prevents the problem of hallucinations as the LLM answers are discrete or very short. At the same time, the accuracy level is increased by using partially overlapping/repeating prompts. When the resulting system does not implement the policy as desired, or the corporate policy itself is changed, it is easy to locate and modify/add the relevant prompts, and verify the mistake will not be repeated. We will show sample cases where we have successfully employed this method to several common problems: e.g. fine text classification of emails on the Enron dataset, and compliance verification on contracts and budget related corporate texts. The process is not strongly coupled to the use of a specific LLM, nor to a specific language. As LLMs become stronger, our method directly benefits from that.

2:10 pm - 2:40 pm Data Engineering

Mastering Real-time Processing While Effectively Scaling Data Quality with Lambda or Kappa Architecture

Vipul Bharat Marlecha Senior Big Data Engineer at Netflix

In a world that creates 2.5 quintillion bytes of data every day, auditing data at scale becomes a challenge of unprecedented magnitude. ‘Mastering Real-time Processing while effectively scaling Data Quality with Lambda or Kappa Architecture’ provides a deep-dive into powerful methodologies, revealing design patterns that turn this challenge into an opportunity for businesses. Join us as we navigate the complexities of data audits and discover how leveraging these techniques can drive efficiency, reduce latency, and deliver actionable insights from your data - at any scale.

2:50 pm - 3:20 pm LLMs

Reasoning in Large Language Models

Maryam Fazel-Zarandi, PhD Researcher Engineering Manager, FAIR at Meta

Scaling language models has improved state-of-the-art performance on nearly every NLP benchmark, with large language models (LLMs) performing impressively as few-shot learners. Despite these achievements, even the largest of these models still struggle with tasks that require reasoning. Recent work has shown that prompting or fine-tuning LLMs to generate step-by-step rationales, or asking them to verify their final answer can lead to improvements on reasoning tasks. While these methods have proven successful in specific domains, there is still no general framework for LLMs to be capable of reasoning in a wide range of situations. In this talk, I will give an overview of some of the existing methods used for improving and eliciting reasoning in large language models, methods for evaluating reasoning in these models, and discuss limitations and challenges.

3:30 pm - 4:00 pm Responsible AI

Resisting AI

Dr. Dan McQuillan Lecturer in Creative and Social Computing at Goldsmiths, University of London

This session will introduce the arguments set out in the book 'Resisting AI'. The objective is to reframe the operations of AI as social and political as well as technical, to highlight their potential for amplifying social harms, and to equip participants with alternative ways to assess the social purpose of their work. Starting from the specific technical operations of deep learning and generative AI, the talk will explore AI's direct social impacts, its immersion in institutional and bureaucratic structures, and its resonance with key dynamics at a political, environmental and global level. The talk will challenge the sense that AI represents a sudden acceleration into a sci-fi future, and will draw out the different ways in which abstract computational operations are entangled with the same messy histories and politics as everything else in society. As such, it is an opportunity for participants to acknowledge the more uncomfortable aspects of the AI industry. The aim is to empower participants to call out solutionism, thoughtlessness or epistemic injustice and to steer their work away from instances of algorithmic violence or social and political exclusion. However, the talk will also set out proposals for positive futures in which complex computation is embedded in feminist and decolonial relationality and constitutes a technical practice for the common good.

9:00 am - 9:25 am

Social and Ethical Implications of Generative AI

Abeba Birhane Senior Fellow in Trustworthy AI at Mozilla Foundation | Adjunct Lecturer/Assistant Professor at Trinity College Dublin

As Artificial Intelligence systems pervade day-to-day life, the need for these systems to be robust, fair, accurate, and just has become of urgent importance. As the foundational backbone of AI systems, large scale datasets play a crucial role in the performance, accuracy, robustness, fairness and trustworthiness of AI systems. In this talk, I: a) present work that highlights numerous concerns arising from large scale datasets, b) discuss the downstream impact of such dataset on models (including the exacerbation of societal biases and negative stereotypes) and c) review some approaches to both incremental improvements as well as shepherding broader structural change.

9:30 am - 9:55 am LLMs

Deep Reinforcement Learning in the Real World: From Chip Design to LLMs

Anna Goldie Senior Staff Research Scientist | Google DeepMind

Reinforcement learning (RL) is famously powerful but difficult to wield, and until recently, had demonstrated impressive results on games, but little real world impact. I will start the talk with a discussion of RL for Large Language Models (LLMs), including scalable supervision techniques to better align models with human preferences (Constitutional AI / RLAIF). Next, I will discuss RL for chip floorplanning, one of the first examples of RL solving a real world engineering problem. This learning-based method can generate placements that are superhuman or comparable on modern accelerator chips in a matter of hours, whereas the strongest baselines require human experts in the loop and can take several weeks. This method was published in Nature and used in production to generate superhuman chip layouts for the last four generations of Google’s flagship AI accelerator (TPU).

10:10 am - 10:40 am Multimodal and Deep Learning

End-to-End Speech Recognition: The Journey from Research to Production

Tara Sainath, PhD Principal Research Scientist at Google DeepMind

End-to-end (E2E) speech recognition has become a popular research paradigm in recent years, allowing the modular components of a conventional speech recognition system (acoustic model, pronunciation model, language model), to be replaced by one neural network. In this talk, we will discuss a multi-year research journey of E2E modeling for speech recognition at Google. This journey has resulted in E2E models that can surpass the performance of conventional models across many different quality and latency metrics, as well as the productionization of E2E models for Pixel 4, 5 and 6 phones. We will also touch upon future research efforts with E2E models, including multi-lingual speech recognition.

10:10 am - 10:40 am Data Visualization

How to Become a True Dataviz Pro

Nick Desbarats Globally recognized educator and best-selling author at Practical Reporting Inc.

Many analytics, data science and AI initiatives involve presenting data to stakeholders in charts, however, many charts are poorly designed and leave audiences confused, unmoved, or misled—even when the chart creator wasn’t trying to confuse or mislead anyone. Even charts from high tech companies, universities, government agencies and major news media outlets often suffer these fates. Why do charts so frequently flop with audiences? Often, the reasons are surprisingly mundane: poor chart type choices, poor scale formatting choices, poor color choices, and a host of other basic design problems. These charts are like potentially great documents that a ruined by basic spelling and vocabulary mistakes. Like any other language, the language of data visualization has a “spelling and vocabulary,” that is, a set of skills and best practices that must be learned in order to communicate effectively in that language. The “spelling and vocabulary of data visualization” includes knowing how to choose chart types, scale ranges, and colors (among many other design choices), as well as knowing how to make charts obvious by highlighting key elements, adding key insights as chart titles and annotations, and adding comparison or reference values. If a chart creator hasn’t learned these basic skills, they’re at high risk of producing ineffective (and potentially misleading) charts. In this eye-opening talk, globally recognized data visualization educator and best-selling author Nick Desbarats explains exactly what it takes to learn the basic “spelling and vocabulary of data visualization,” and how to become a true data visualization pro, able to design clear, compelling charts every time. Attendees should have experience creating basic charts in a data visualization application (Excel, Tableau, Qlik, etc.)

10:50 am - 11:20 am LLMs

Data Automation with LLM

Rami Krispin Senior Manager - Data Science and Engineering at Apple

In today's business environment, data plays a crucial role in decision-making. However, obtaining the required data can be challenging due to data engineering or data science resource constraints, leading to delays, inefficiency, and potential losses. This talk will focus on creating a self-serve bot (e.g., Slack bot) that can serve data requests and support ad-hoc requests by leveraging LLM applications. This involves building a natural language to SQL engine using tools such as OpenAI API or open-source models that leverage the Hugging Face API.

11:30 am - 12:00 pm NLP

Applying Responsible Generative AI in Healthcare

David Talby, PhD Chief Technology Officer at John Snow Labs

The past year has been filled with frameworks, tools, libraries, and services that aim to simplify and accelerate the development of Generative AI applications. However, a lot of them do not work in practice, on real use cases and dataset. This session surveys lessons learned from real-world projects in healthcare that created a compelling POC and only then uncovered major gaps from what a production-grade system will require: 1. Fragility and sensitivity of current LLMs in minor changes to both datasets and prompts and their accuracy impact. 2. Where guardrails and prompt engineering fall short in addressing critical bias, sycophancy, and stereotype risks. 3. The vulnerability of current LLM’s to known medical cognitive biases such as anchoring, ordering, and attention bias. This session is intended for practitioners who are building Generative AI systems in Healthcare and need to be aware of the legal and reputation risks involved and what can be done to mitigate them.

12:20 pm - 12:50 pm MLOPs

Shifting Gears to LLMOps: Understanding the Challenges in MLOps for LLMs

Noel Konagai UX Researcher at Google

With the rise of Generative AI we are increasingly confronted with a pertinent question: what about our MLOps (Machine Learning Operations) needs to change to accommodate LLMs (Large Language Models)? We argue that fundamentally the principles of MLOps are still applicable to LLMs, but the “how” of MLOps changes with LLMs. While LLMs can be used in Classical ML tasks (e.g. sentiment analysis), what complicates MLOps for LLMs is that we see a shift from model-centric thinking to an application-centric thinking. A chatbot application may not only contain the LLM itself but it might use Retrieval Augmented Generation (RAG) with a knowledge base to reduce hallucinations, use a fine-tuning process to adjust the tone of the chatbot, and use plug-ins to execute tasks on a third-party platform. Challenges in LLM evaluation ensue: while in Classical ML we had industry standard quantitative metrics such as root-mean-square error that help assess the model performance, with LLMs we enter an ambiguous space with new methods emerging to evaluate the end-user experience. All these additional components complicate running, tracking and evaluating experiments with LLMs. In this talk, we present a five step process that compares each step of MLOps (discovery, development, evaluation, deployment, and monitoring) for Classical ML with the new challenges of operationalizing LLMs for generative applications. In this talk we focus on LLMs used for generative purposes, such as chatbots. Attendees can walk away with an increased understanding of the methods and frameworks to understand their LLM productionization process, better equipped to tackle the challenges of MLOps for LLMs.

1:00 pm - 1:30 pm MLOps

Abstracting ARM/x86 CPUs and NVIDIA/Neuron Hardware Accelerator Allocation for Containerized ML App

Yahav Biran, PhD Principal Architect at Amazon Web Services

The shortage of hardware accelerators delays model training for customers with computationally intensive and parallel processing capabilities. Moreover, the lack of applications’ flexibility to support both general-purpose compute and high availability accelerators, makes training jobs rigid and difficult to resume after unexpected host interruptions. Also, customers cannot deploy flexible inference services that enable cost, availability, latency, and performance tradeoffs. e.g., defines compute priorities for inferences with different CPU and HW accelerator prices and locations. Until today, customers who trained models and offered model inference services had to manually configure compute infrastructure requirements that matched their application. If these resources could not be allocated, the job was delayed. Cube-scheduler allows more flexibility for machine learning jobs by automatically detecting and matching job specification to processor and hardware accelerator. Cube-scheduler seamlessly invokes ML software packages on optimal resources by abstracting the underlying runtime packages such as Linux and Python.

1:40 pm - 2:10 pm ML Safety & Security

Navigating the Landscape of Responsible AI: Principles, Practices, and Real-World Applications

Rajiv Avacharmal Corporate Vice President | New York Life Insurance

As Artificial Intelligence (AI) becomes increasingly integrated into our daily lives and business, it is imperative that we develop and deploy AI systems responsibly. The rapid advancement of AI technologies presents both immense opportunities and significant challenges, particularly in ensuring that AI systems are ethical, transparent, and accountable. This session will delve into the critical aspects of Responsible AI into the principles, practices, and real-world applications of this essential field. We will begin by exploring the fundamental principles of Responsible AI, including fairness, transparency, accountability, and privacy. These principles serve as the foundation for developing AI systems that are unbiased, explainable, and aligned with societal values. We will discuss the ethical considerations that must be taken into account throughout the AI lifecycle, from data collection and model training to deployment and monitoring. The session will then focus on the practical strategies and tools for implementing Responsible AI. We will cover techniques for mitigating bias in AI models, such as diverse and inclusive datasets, algorithmic fairness metrics, and continuous testing and monitoring. Attendees will learn about the importance of transparency and explainability in AI, and how to incorporate these principles into the design and development of AI systems. We will also address the critical role of governance and regulation in ensuring Responsible AI. This includes discussing the current landscape of AI regulations and guidelines, such as the EU Ethics Guidelines for Trustworthy AI and the IEEE Ethically Aligned Design framework. We will explore how organizations can establish robust governance frameworks that ensure AI systems meet ethical standards and comply with legal requirements.

2:00 pm - 2:30 pm

Session by Suraj Subramanian

Suraj Subramanian Machine Learning Advocate at Facebook

Session Abstract Coming Soon!

2:20 pm - 2:50 pm MLOps

Cost Containment A Critical Piece of your Data Team's ROI

Lindsay Murphy Head of Data at Secoda

Data teams spend a lot of time measuring and optimizing the effectiveness of other teams. Unfortunately, we're not so great at doing this for ourselves. In this talk, we will dive into a big blind spot that a lot of data teams operate with–not knowing how much they are costing their business (now and in the future). Given how easy it is to rack up expensive bills in pay-as-you-go tools across data stacks, this can become a big problem for data teams, very fast. We'll discuss tactical methods for building cost monitoring metrics and reporting (and why you should make this a priority), some of the challenges you will face along the way, and suggest ways to implement cost containment best practices into your workflows to drive cost accountability across your data team and company.

2:20 pm - 2:50 pm Machine Learning

The Promise of Edge ML: Bringing Your Model to Your Data

David Aronchick CEO at Expanso

In the intersection of machine learning (ML) and edge computing, this talk will explore the new opportunity in processing data with ML where it's generated. We'll discuss the advantages of edge ML, including immediate insights, privacy preservation, and reduced network demands. Challenges like resource constraints and the need for efficient model management will be addressed, emphasizing solutions such as lightweight architectures and robust MLOps practices. The session will briefly highlight the impact on industries like autonomous vehicles and smart manufacturing, and the environmental benefits of localized data processing. Attendees will understand how edge ML is a strategic necessity for harnessing data's full potential, ensuring privacy, and enhancing operational efficiency. Join us to discover how ML at the edge is driving the next wave of digital innovation.

3:00 pm - 3:30 pm Machine Learning

Leveraging Predictive Models and Data Science to Optimize Information Retrieval Systems

Vidhya Suresh Senior Software Engineer at Atlassian
Hareen Venigalla Applied Science Manager at Uber Inc

This presentation explores how data science and predictive modeling optimize the performance and scalability of information retrieval (IR) systems. We'll examine the impact of query analysis, document ranking, and result aggregation on user satisfaction. Our research demonstrates that techniques like keyword extraction, intent analysis, and custom deep ranking models can reduce irrelevant results by up to 26% while decreasing computing costs by more than 39%. We'll address the challenges of scaling IR systems to handle massive datasets and high query volumes, highlighting how predictive models streamline resource-intensive processes. Finally, we'll present optimization strategies leveraging distributed computing, multi-stage caching, and predictive ranking models to enhance throughput, reduce latency, and minimize computational overhead. This presentation offers valuable insights for those interested in the intersection of data science and information retrieval.

9:30 am - 9:55 am LLMs

Setting Up Text Processing Models for Success: Formal Representations versus Large Language Models

Carolyn Rosé, PhD Professor, Program Director for the Masters of Computational Data Science at Carnegie Mellon University

With increasingly vast storehouses of textual data readily available, the field of Natural Language Processing offers the potential to extract, organize, and repackage knowledge revealed either directly or indirectly. Though for decades one of the holy grails of the field has been the vision of accomplishing these tasks with minimal human knowledge engineering through machine learning, with each new wave of machine learning research, the same tensions are experienced between investment in knowledge engineering and integration know-how on the one hand and production of knowledge/insight on the other hand. This talk explores techniques for injecting insight into data representations to increase effectiveness in model performance, especially in a cross-domain setting. Recent work in neural-symbolic approaches to NLP is one such approach, in some cases reporting advances from incorporation of formal representations of language and knowledge and in other cases revealing challenges in identifying high utility abstractions and strategic exceptions that frequently require exogenous data sources and the interplay between these formal representations and bottom-up generalities that are apparent from endogenous sources. More recently, Large Language Models (LLMs) have been used to produce textual augmentations to data representations, with more success. Couched within these tensions, this talk reports on recent work towards increased availability of both formal and informal representations of language and knowledge as well as explorations within the space of tensions to use this knowledge in effective ways.

10:00 am - 10:30 am Generative AI

Build GenAI Systems, Not Models

Hugo Bowne-Anderson, PhD Head of Developer Relations at Outerbounds

This talk explores a framework for how data scientists can deliver value with Generative AI: How can you embed LLMs and foundation models into your pre-existing software stack? How can you do so using Open Source Python? What changes about the production machine learning stack and what remains the same? We motivate the concepts through generative AI examples in domains such as text-to-image (Stable Diffusion) and text-to-speech (Whisper) applications. Moreover, we’ll demonstrate how workflow orchestration provides a common scaffolding to ensure that your Generative AI and classical Machine Learning workflows alike are robust and ready to move safely into production systems. This talk is aimed squarely at (data) scientists and ML engineers who want to focus on the science, data, and modeling, but want to be able to access all their infrastructural, platform, and software needs with ease!

10:40 am - 11:10 am Responsible AI

Advancing Ethical Natural Language Processing: Towards Culture-Sensitive Language Models

Gopalan Oppiliappan Head, AI Centre of Excellence at Intel India

Natural Language Processing (NLP) systems play a pivotal role in various applications, from virtual assistants to content generation. However, the potential for biases and insensitivity in language models has raised concerns about equitable representation and cultural understanding. This talk explores the development of Culture-Sensitive Language Models (LLMs) as a progressive step towards addressing these issues. The core principles involve diversifying training data to encompass a wide range of cultures, implementing bias detection and mitigation strategies, and fostering collaboration with cultural experts to enhance contextual understanding. Our approach emphasizes the importance of ethical guidelines that guide the development and deployment of LLMs, focusing on principles such as avoiding stereotypes, respecting cultural diversity, and handling sensitive topics responsibly. The models are designed to be customizable, allowing users to fine-tune them according to specific cultural requirements, fostering inclusivity and adaptability. The incorporation of multilingual capabilities ensures that the models cater to global linguistic diversity, acknowledging the richness of different languages and cultural expressions. Moreover, we propose a feedback mechanism where users can report instances of cultural insensitivity, establishing a continuous improvement loop. Transparency and explainability are prioritized to enable users to comprehend the decision-making process of the models, promoting accountability. Through this multidimensional approach, we aim to advance the field of NLP by developing culture-sensitive LLMs that not only understand and respect diverse cultural nuances but also contribute to a more inclusive and ethical use of language technology.

11:20 am - 11:50 am Generative AI

Beyond Theory: Effective Strategies for Bringing Generative AI into Production

Heiko Hotz Generative AI Global Blackbelt | Google

In the rapidly evolving and constantly advancing landscape of artificial intelligence, foundation models like GPT-4 and DALL-E 3 and the broader world of generative AI have emerged as potential game-changers, offering unprecedented and previously unimagined capabilities across a wide variety of domains and use cases. However, while these theoretical models showcase promising capabilities, the practical challenge of transitioning from conceptual research to full-scale production-level applications remains a major obstacle that many organizations and teams continue to face. This keynote presentation aims to help bridge this gap by taking a deep dive into exploring pragmatic and actionable strategies and best practices for successfully integrating these cutting-edge AI technologies into real-world business environments. We will closely examine the critical concepts surrounding Foundation Model Operations and Large Language Model Operations (FMOps/LLMOps), delving into the practical intricacies and challenges involved in deploying, monitoring, maintaining and scaling generative AI models in enterprise production systems. The discussion will comprehensively cover several critical topics such as optimal model selection, rigorous testing and evaluation, efficient training and fine-tuning techniques, retrieval augmented generation (RAG) architectures, and effective deployment strategies required for operationalization. Attendees will gain crucial and applicable insights into overcoming common obstacles frequently faced when attempting to deploy AI in live systems, including recommendations around managing resource-intensive models, ensuring ongoing model fairness and transparency, and strategically adapting to the continuously fast evolving AI landscape. To provide full perspective, the talk will also highlight relevant real-world examples and case studies, providing a comprehensive end-to-end view of the demanding practical requirements for true AI deployment. This presentation has been tailored for a wide audience encompassing AI and machine learning professionals, technology leaders, IT and DevOps teams, and anyone generally interested in better understanding the operational side of taking AI technology live. Whether you're looking to implement generative AI capabilities in your own organisation or working to enhance existing AI operations, this discussion will equip you with directly actionable knowledge and tools to successfully meet the challenges in navigating the world of FMOps/LLMOps.

12:30 pm - 1:00 pm LLMs

Moving Beyond Statistical Parrots - Large Language Models and their Tooling

Ben Auffarth, PhD Author: Generative AI with LangChain | Lead Data Scientist at Hastings Direct

Large language models like GPT-4 and Codex have demonstrated immense capabilities in generating fluent text. However, simply scaling up data and compute results in statistical parroting without true intelligence. This talk explores frameworks and techniques to move beyond statistical mimicry. We discuss leveraging tools to retrieve knowledge, prompt engineering to steer models, monitoring systems to detect biases, and cloud offerings to deploy conversational agents. This talk explores the emerging ecosystem of frameworks, services, and tooling that propel large language models and enable developers to build impactful applications powered by large language models. Complex mechanisms like function calling and Retrieval Augmented Generation, navigating towards meaningful outputs and applications requires an overarching focus on strong model governance frameworks that can ensure that biases and harmful ideologies embedded in the training data are duly mitigated, paving the way towards beneficial application development. Developers play a crucial role in this process and should be empowered with tools and knowledge to steer these models appropriately. Intentional use of these elements not only optimizes model governance but also enriches the experience for developers, allowing them to dig deeper and create substantial applications that are not mere parroting, but stockholders of genuine value. From deploying conversational agents to crafting impactful applications across a swath of industries, such as healthcare and education, the comprehensive understanding and utilization of the vast array of LLM mechanisms can truly push the boundaries of NLP and AI, helping to usher in the age of AI in everyday life.

1:10 pm - 1:40 pm Deep Learning

AI Resilience: Upskilling in an AI Dominant Environment

Leondra Gonzalez Senior Data & Applied Scientist at Microsoft

The boom of generative AI and LLMs have taken the world by storm. This development has already disrupted various industries and roles, and data science is no exception to that rule. In a word of embeddings and transfer learning, one might beg to question "What should I learn next?" and "Where should I spend my time and energy for deep dives?". This talk aims to guide existing AI practitioners on how to maintain relevant skills in an increasingly automated world, and how to stand out in an oversaturated job market.

1:50 pm - 2:20 pm Data Engineering

Unlocking the Unstructured with Generative AI: Trends, Models, and Future Directions.

Jay Mishra Chief Operating Officer at Astera

DE Summit: The exponential growth in computational power, alongside the advent of powerful GPUs and advancements in cloud computing, has ushered in a new era of generative artificial intelligence (AI), transforming the landscape of unstructured data extraction. Traditional methods such as text pattern matching, optical character recognition (OCR), and named entity recognition (NER) have been plagued by challenges related to data quality, process inefficiency, and scalability. However, the emergence of large language models (LLMs) has provided a groundbreaking solution, enabling the automated, intelligent, and context-aware extraction of structured information from the vast oceans of unstructured data that dominate the digital world. This talk delves into the innovative applications of generative AI in natural language processing and computer vision, highlighting the technologies driving this evolution, including transformer architectures, attention mechanisms, and the integration of OCR for processing scanned documents. We will also talk about future of generative AI in handling complex datasets. Participants will gain insights into: The fundamental challenges and solutions in unstructured data extraction. The operational dynamics of Generative AI in extracting structured information. Future of generative AI in unstructured data extraction Practical insights into leveraging these technologies for real-world applications. Designed for data scientists, AI researchers, and industry professionals, the talk aims to equip attendees with the knowledge to harness the power of Generative AI in transforming unstructured data into actionable insights, thereby driving innovation and efficiency across industries.

2:50 pm - 3:20 pm Data Engineering

Data Engineering in the Age of Data Regulations

Alex Gorelik Distinguished Engineer at LinkedIn

DE Summit: Continuous data regulations like GDPR, CCPA, DMA and many others are giving control to users over how their data is used and imposing restrictions on what companies can do with user data. This talk will focus on LinkedIn's approach to converting these regulations into policies and integrating policy enforcement in data engineering practices using our Policy Based Access Control (PBAC) system. It will cover how to annotate data, features, pipelines and models; how to integrate model training and inferences with the PBAC system; and how to enforce policies. It will describe the architecture and components of LinkedIn's governance system and various tools used to automate the annotation and enforcement process. LinkedIn plans to open source its PBAC system, but probably will not happen by April. The talk will also reference DataHub data catalog and FalDisco automated data classification projects.

3:10 pm - 3:40 pm Data Engineering

Deciphering Data Architectures (choosing between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh)

James Serra Data & AI architect at Microsoft

DE Summit: Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. These new architectures have solid benefits, but they’re also surrounded by a lot of hyperbole and confusion. In this presentation I will give you a guided tour of each architecture to help you understand its pros and cons. I will also examine common data architecture concepts, including data warehouses and data lakes. You’ll learn what data lakehouses can help you achieve, and how to distinguish data mesh hype from reality. Best of all, you’ll be able to determine the most appropriate data architecture for your needs. The content is derived from my book Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh.

3:30 pm - 4:00 pm Machine Learning

Machine Learning Across Multiple Imaging and Biomarker Modalities in the UK Biobank Improves Genetic Discovery for Liver Fat Accumulation

Sumit Mukherjee Staff Machine Learning Scientist at Insitro

Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD), a condition where the liver contains more than 5.5% fat, is a major risk factor for chronic liver disease, affecting an estimated 30% of people worldwide. Although MASLD is a genetically complex disease, large- scale case-control cohort studies based on MASLD diagnosis have shown only limited success in discovering genes responsible for MASLD. This is largely due to the challenges in accurately and efficiently measuring the disease characteristics, which is often expensive, time-consuming, and inconsistent. In this study, we showcase the power of machine learning (ML) in addressing these challenges. We used ML to predict the amount of fat in the liver using three different types of data from the UK Biobank: body composition data from dual-energy X-ray absorptiometry (DXA), plasma metabolites, and a combination of anthropometric and blood-based biochemical markers (biomarkers). For DXA-based predictions, we used deep learning models, specifically EfficientNet-B0, to predict fat content from DXA scans. For predictions based on metabolites and biomarkers, we used a gradient boosting model, XGBoost. Our ML models estimated that up to 29% of participants in the UK Biobank met the criteria for MASLD, while less than 10% received the clinical diagnosis. We then used these estimates to identify regions of the genome associated with liver fat, finding a total of 321 unique regions, including 312 new ones, significantly expanding our understanding of the genetic determinants of liver fat accumulation. Our ML-based genetic findings showed a high genetic correlation with clinically diagnosed MASLD, suggesting that the genetic regions we identified are also likely to be relevant for understanding and diagnosing the disease in a clinical setting. This strong correlation underscores the potential of our approach to contribute to real-world medical applications. Our findings highlight the value of ML in identifying disease-related genes and predicting disease risk, demonstrating its potential to enhance our understanding of complex diseases like MASLD. This study highlights the potential of data science to help transform healthcare research and improve patient outcomes.

9:30 am - 11:30 am LLMs

LLM Best Practises: Training, Fine-Tuning and Cutting Edge Tricks from Research

Sanyam Bhutani Sr. Data Scientist and Kaggle Grandmaster

Large Language Models (LLMs) are still relatively new compared to ""Traditional ML"" techniques and have many new ideas as best practises that differ from training ML models.Fine-Tuning models can be really powerful to unlock use-cases based on your domain and AI Agents can be really powerful to unlock previously impossible ideas. In this workshop, you will learn the tips and tricks of creating and fine-tuning LLMs along with implementing cutting edge ideas of building these systems from the best research papers. We will start by learning the foundations behind what makes a LLM, quickly moving into fine-tuning our own GPT and finally implementing some of the cutting edge tricks of building these models. There is a lot of noise and signal in this domain right now, we will focus on understanding the ideas that have been tried and tested. The workshop will also cover case studies spanning ideas that have worked in practise we will dive deep into the art and science of working with LLMs.

10:05 am - 11:05 am Generative AI

Mastering PrivateGPT: Tailoring GenAI for your unique applications

Dr. Daniel Gallego Vico Co-Founder at PrivateGPT | Zylon
Iván Martínez Toro Creator and Main Contributor | Co-founder at PrivateGPT

Tutorial: PrivateGPT, a well-recognized open-source project with 48K Github stars and a Discord community composed by more than 3K supporters, offers a robust framework for developing Private Context-aware GenAI applications. Tailored to support real-world production scenarios, it provides a default set of functions that efficiently handle common tasks like ingestion of documents, contextual chat and completions, as well as embeddings generation. However, its true strength lies in its adaptability, enabling customization for specific applications. This tutorial guides you through various configuration options and extensions of PrivateGPT. You'll begin by gaining hands-on experience with its default API and functionalities using its Python SDK. Subsequently, you'll explore tweaking settings to adapt it to different setups, ranging from fully local where everything runs in your computer to multi-service where the LLM, embedding model or vector database can be served by different services. The final segment of the tutorial will lead you through PrivateGPT's internal AI logic and architecture to learn how to extend its basic RAG functionalities. Upon completing this tutorial, you'll acquire the skills to customize PrivateGPT for any scenario, whether it be for personal use, intra-company initiatives, or as part of innovative commercial production setups.

12:00 pm - 1:00 pm NLP

Build AI Assistants with Large Language Models

Rafael Vasquez Software Developer at IBM
James Busche Senior Software Developer at IBM

Over the past year, there has been a surge in the popularity of Large Language Models (LLMs). However, how can we effectively leverage LLMs to augment our businesses? One example would be the integration of LLMs into existing business frameworks through the deployment of AI Assistants. These assistants serve as invaluable tools in addressing customer inquiries and minimizing the demand for technical support within organizations. In this session, we will dive into the practicalities of utilizing LLM-powered AI Assistants and seamlessly integrating them into established systems. This workshop provides an easy-to-follow guide on how to use LLMs, configure the settings for your first AI Assistant with LLMs, and seamlessly integrate AI Assistant into an established system. Session Outline: 1. Learn about LLM basic We will be using LLMs hosting on IBM Digital Self-Serve Co-Create Experience (DSCE), but you can also use models that are hosted on other platforms such as Huggingface. 2. Configure the settings for your first AI Assistant with LLMs Learn the basics of watsonx Assistant and create the first AI conversation with LLMs. Then apply this chatbot to an established system. Background Knowledge: The attendees will learn about the concept of building a chatbot, create AI conversation, and integrate it into production.

12:00 pm - 1:00 pm Deep Learning

How to Practice Data-Centric AI and Have AI Improve its Own Dataset

Jonas Mueller Chief Scientist and Co-Founder at Cleanlab

DE Tutorial: In Machine Learning projects, one starts by exploring the data and training an initial baseline model. While it’s tempting to experiment with different modeling techniques right after that, an emerging science of data-centric AI introduces systematic techniques to utilize the baseline model to find and fix dataset issues. Improving the dataset in this manner, one can drastically improve the initial model’s performance without any change to the modeling code at all! These techniques work with any ML model and the improved dataset can be used to train any type of model (allowing modeling improvements to be stacked on top of dataset improvements). Such automated data curation has been instrumental to the success of AI organizations like OpenAI and Tesla. While data scientists have long been improving data through manual labor, data-centric AI studies algorithms to do this automatically. This tutorial will teach you how to operationalize fundamental ideas from data-centric AI across a wide variety of datasets (image, text, tabular, etc). We will cover recent algorithms to automatically identify common issues in real-world data (label errors, bad data annotators, outliers, low-quality examples, and other dataset problems that once identified can be easily addressed to significantly improve trained models). Open-source code to easily run these algorithms within end-to-end Data Science projects will also be demonstrated. After this tutorial, you will know how to use models to improve your data, in order to immediately retrain better models (and iterate this data/model improvement in a virtuous cycle).

1:10 pm - 2:10 pm Gen AI

Deploying Trustworthy Generative AI

Krishnaram Kenthapadi Chief AI Officer & Chief Scientist at Fiddler AI

Tutorial: Generative AI models and applications are being rapidly deployed across several industries, but there are several ethical and social considerations that need to be addressed. These concerns include lack of interpretability, bias and discrimination, privacy, lack of model robustness, fake and misleading content, copyright implications, plagiarism, and environmental impact associated with training and inference of generative AI models. In this talk, we first motivate the need for adopting responsible AI principles when developing and deploying large language models (LLMs) and other generative AI models, and provide a roadmap for thinking about responsible AI for generative AI in practice. Focusing on real-world LLM use cases (e.g. evaluating LLMs for robustness, security, etc. using https://github.com/fiddler-labs/fiddler-auditor), we present practical solution approaches / guidelines for applying responsible AI techniques effectively and discuss lessons learned from deploying responsible AI approaches for generative AI applications in practice. By providing real-world generative AI use cases, lessons learned, and best practices, this talk will enable researchers & practitioners to build more reliable and trustworthy generative AI applications. Please take a look at our recent ICML/KDD/FAccT tutorial (https://sites.google.com/view/responsible-gen-ai-tutorial) for an expanded version of this talk.

2:00 pm - 3:00 pm Machine Learning

Idiomatic Pandas

Matt Harrison Python & Data Science Corporate Trainer | Consultant at MetaSnake

Pandas can be tricky, and there is a lot of bad advice floating around. This tutorial will cut through some of the biggest issues I've seen with Pandas code after working with the library for a while and writing three books on it. We will discuss: * Proper types * Chaining * Aggregation * Debugging Are you confused or frustrated with Pandas? Or maybe your own Pandas code when you come back to it later, you find it confusing or difficult to work with. I've taught Pandas to thousands in Corporate settings, Universities, and Virtually. I've also seen the bad code that my students write and have strong opinions on how to correct it. This workshop assumes you want to apply idiomatic constructs to existing code. There will be some lecture and then breakout time to apply the constructs on your own: We will cover * Types * Chaining * Mutation * Aggregation * Debugging Tutorial Outline ================== * Introduction (5 min) * Loading data & Types (40 min) * Lab * Chaining (45 min) * Lab * Mutation (5 min) * Aggregation (40 min) * Lab * Debugging (45 min) * Lab

2:20 pm - 3:20 pm Generative AI

Stable Diffusion: Advancing the Text-to-Image Paradigm

Sandeep Singh Head of Applied AI/Computer Vision at Beans.ai

This session will introduce attendees to Stable Diffusion, a new text-to-image generation model that is more stable and efficient than previous models. Stable Diffusion is able to generate high-quality images from text descriptions, and it is well-suited for a variety of applications, such as creative content generation, product design, and marketing. Learning Outcomes: By the end of this session, attendees will be able to: - Understand the basics of Stable Diffusion and how it works. - Know whole landscape of tools and libraries for Stable Diffusion domain. - Generate images from text descriptions using Stable Diffusion. - Apply Stable Diffusion to their own projects and workflows. - Understand the process of fine-tuing open source models to achieve tasks at hand. This session is relevant to practitioners in a variety of industries, including: Creative industries: Stable Diffusion can be used to generate images for marketing materials, product designs, and other creative projects. Technology industries: Stable Diffusion can be used to develop new applications for text-to-image generation, such as chatbots and virtual assistants. Research industries: Stable Diffusion can be used to conduct research on text-to-image generation and its applications.

2:20 pm - 3:20 pm Machine Learning

No-Code and Low-Code AI: A Practical Project Driven Approach to ML

Gwendolyn D. Stripling, PhD Lead AI & ML Content Developer at Google Cloud

Tutorial: No-code machine learning (ML) is a way to build and deploy ML models without having to write any code. Low-code ML is a way to build and deploy ML models with minimal coding. Both methods can be valuable for businesses and individuals who do not have the skills or resources to develop ML models themselves. By completing this workshop, you will develop an understanding of no-code and low-code frameworks, how they are used in the ML workflow, how they can be used for data ingestion and analysis, and for building, training, and deploying ML models. You will become familiar with Google’s Vertex AI for both no-code and low-code ML model training, and Google’s Colab, a free Jupyter Notebook service for running Python and the Keras Sequential API, a simple and easy-to-use API that is well-suited for beginners. You will also become familiar with how to assess when to use low-code, no-code, and custom ML training frameworks. The primary audience for this workshop are aspiring citizen data scientists, business analysts, data analysts, students, and data scientists who seek to learn how to very quickly experiment, build, train, and deploy ML models.

10:00 am - 11:00 am Machine Learning

Causal AI: from Data to Action

Dr. Andre Franca CTO at connectedFlow

In this talk, we will explore and demystify th world of Causal AI for data science practitioners, with a focus on understand cause-and-effect relationships within data to drive optimal decisions. In this talk, we will focus on: * from shapley to DAGs: the dangers of using post-hoc explainability methods as tools for decision making, and how tranditional ML isn't suited in situations where want to perform interventions on the system. * discovering causality: how do we figure out what is causal and what isn't, with a brief introduction to methods of structure learning and causal discovery * optimal decision making: by understanding causality, we now can accurately estimate the impact we can make on our system - how to use this knowledge to derive the best possible actions to make? This talk is aimed at both data scientists and industry practitioners who have a working knowledge of traditional statistics and basic ML. This talk will also be practical: we will provide you with guidance to immediately start implementing some of these concepts in your daily work.

10:00 am - 11:00 am LLMs

Operationalizing Local LLMs Responsibly for MLOps

Noah Gift Pioneering MLOps Leader & Author, Veteran Startup CTO, Duke Data Science & AI EIR

I. Introduction to LLMs (5 mins) Defining foundation of large language models Use cases like search, content generation, programming II. Architecting High-Performance LLM Pipelines (15 mins) Storing training data efficiently at scale Leveraging specialized hardware accelerators Optimizing hyperparameters for cost/accuracy Serving inferences with low latency III. Monitoring and Maintaining LLMs (10 mins) Tracking model accuracy and performance Retraining triggers to stay performant Evaluating inferences for bias indicators Adding human oversight loops IV. Building Ethical Guardrails for Local LLMs (10 mins) Auditing training data composition Establishing process transparency Benchmarking rigorously on safety Implementing accountability for production systems V. The Future of Responsible Local LLMs (5 mins) Advances that build trust and mitigate harms Policy considerations around generative models Promoting democratization through education

11:10 am - 12:10 pm Generative AI

Everything About Large Language Models: Pre-training, Fine-tuning, RLHF & State of the Art

Chandra Khatri VP, Head of AI at Krutrim

Generative Large Language Models like GPT4 have revolutionized the entire tech ecosystem. But what makes them so powerful? What are the secret components which make them generalize to a variety of tasks? In this talk, I will present how these foundation models are trained. What are the steps and core-components behind these LLMs? I will also cover how smaller, domain-specific models can outperform general purpose foundation models like ChatGPT on target use cases

11:10 am - 12:10 pm NLP

Machine Learning using PySpark for Text Data Analysis

Bharti Motwani Clinical Associate Professor at University of Maryland, USA

In this session, unsupervised Machine Learning algorithms like Cluster Analysis and recommendation System and supervised Machine Learning algorithms like Random Forest, Decision Tree, Bagging and Boosting will be discussed for doing analysis using PySpark. The main feature of this workshop will be the implementation of these algorithms using the Text Data. Considering the importance of reviews and text data available on social media platforms, the availability and importance of text data analysis has grown multifold. The session will be particularly helpful for startups and existing business who wanted to use AI for improving performance.

12:30 pm - 1:30 pm LLMs

Prompt Engineering with Llama 3

Amit Sangani Director of Partner Engineering at Meta

This session aims to provide hands-on, engaging content that gives developers a basic understanding of Llama 3 models, how to access and use them, understand the architecture and build an AI chatbot using LangChain and Tools. The audience will also learn core concepts around Prompt Engineering and Fine-Tuning and programmatically implement them using Responsible AI principles. Lastly, we will conclude the talk by explaining how they can leverage this powerful tech, different use cases and what the future looks like. Section 1: Understanding Llama 3 Familiarize yourself with Llama 3 models and architecture, how to download, install and access them, and the basic use cases it can accomplish. Additionally, we will review basic completion, system prompts and responses in different formats. Section 2: Prompt Engineering and Chatbot We will walk through the concepts of Prompt Engineering and chatbot architecture, including implementing single-turn and multi-turn chat requests, hallucinations and how to prevent them, augmenting external data using Retrieval Augment Generation (RAG) principles and implementing all of this using LangChain. We will also review advanced concepts around Fine-Tuning. Section 3: Responsible AI and Future We will discuss the basic Responsible AI considerations as you build your Generative AI strategy and applications, safety measures to address context-specific risks, best practices for mitigating potential risks and more. We will also discuss what the future holds in the Generative AI space and give you a glimpse of what to expect from Llama offerings. Basic knowledge of Python and LLM

12:30 pm - 1:30 pm Machine Learning

Introduction to Linear Regression using Spreadsheets with Real Estate Data

Roberto Reif CEO and Founder at ScholarU

Over the course of this session, we'll embark on a deep dive into the foundational principles of linear regression, a statistical machine learning model that aids in unraveling the intricate relationships between two or more variables. Our unique focus centers on the practical application of linear regression using real-world real estate data, offering a concrete context that will undoubtedly resonate with participants. The workshop kicks off with a thorough overview of linear regression concepts, ensuring a collective understanding of the fundamentals. As we progress, we transition into the practical realm, employing popular spreadsheet tools like Excel or Google Sheets to conduct insightful real estate data analyses. Participants will master the art of data input, application of regression formulas, model building, and interpretation of results, enriching their analytical toolkit. The workshop's core revolves around a hands-on exploration of a real-world scenario. Together, we'll dissect a data set featuring crucial real estate variables such as property prices, square footage, number of bedrooms and bathrooms, and location. This pragmatic approach empowers participants to directly apply linear regression concepts to authentic situations commonly encountered in the dynamic field of real estate. Engagement is key throughout our workshop, featuring interactive exercises, group discussions, and dedicated Q&A sessions to reinforce comprehension. By the workshop's conclusion, participants will wield the skills to adeptly leverage the fundamental machine learning model of linear regression for making informed and predictive decisions in the realm of real estate. Whether you're a novice seeking an introduction to regression analysis or a seasoned analyst aiming to refine your skills, this workshop guarantees a stimulating and enlightening experience.

1:40 pm - 2:40 pm Generative AI

Graphs: The Next Frontier of GenAI Explainability

Michelle Yi Board Member at Women In Data
Amy Hodler Founder, Consultant at GraphGeeks.org

In a world obsessed with making predictions and generative AI, we often overlook the crucial task of making sense of these predictions and understanding results. If we have no understanding of how and why recommendations are made, if we can’t explain predictions – we can’t trust our resulting decisions and policies. In the realm of predictions, explainability, and causality, graphs have emerged as a powerful model that has recently yielded remarkable breakthroughs. Graphs are purposefully designed to capture and represent the intricate connections between entities, offering a comprehensive framework for understanding complex systems. Leading teams use this framework today to surface directional patterns, compute complex logic, and as a basis for causal inference. This talk will examine the implications of incorporating graphs into the realm of generative AI, exploring the potential for even greater advancements. Learn about foundational concepts such as directed acrylic graphs (DAGs), Jedeau Pearl’s “do” operator, and keeping domain expertise in the loop. You’ll hear how the explainability landscape is evolving, comparisons of graph-based models to other methods, and how we can evaluate the different fairness models available. We’ll look into the open source PyWhy project for causal inference and the DoWhy method for modeling a problem as a causal graph with industry examples. By identifying the assumptions and constraints up front as a graph and applying that through each phase of modeling mechanisms, identifying targets, estimating causal effects, and refuting these with each inference – we can improve the validity of our predictions. We’ll also explore other open source packages that use graphs for counterfactual approaches, such as GeCo and Omega. Join us as we unravel the transformative potential of graphs and their impact on predictive modeling, explainability, and causality in the era of generative AI.

1:40 pm - 2:40 pm Machine Learning

Feature Stores in Practice: Build and Deploy a Model with Featureform, Redis, Databricks, and Sagemaker

Simba Khadder Founder & CEO at Featureform

The term ""Feature Store"" often conjures a simplistic idea of a storage place for features. However, in reality, they serve as robust frameworks and orchestrators for defining, managing, and deploying feature pipelines. The veneer of simplicity often masks the significant operational gains organizations can achieve by integrating the right feature store into their ML platform. This session is designed to peel back the layers of ambiguity surrounding feature stores, delineating the three distinct types and their alignment within a broader ML ecosystem. Diving into a hands-on section, we will walk through the process of training and deploying an end-to-end fraud detection model utilizing Featureform, Redis, Databricks, and Sagemaker. The emphasis will be on real-world, applicable examples, moving beyond concepts and marketing talk. This session aims to do more than just explain the mechanics of feature stores. It provides a practical blueprint to efficiently harness feature stores within ML workflows, effectively bridging the chasm between theoretical understanding and actionable implementation. Participants will walk away with a solid grasp of feature stores, equipped with the knowledge to drive meaningful insights and enhancements in their real-world ML platforms and projects.

2:50 pm - 3:50 pm Deep Learning

Topological Deep Learning: Going Beyond Graph Data

Dr. Mustafa Hajij Assistant Professor at University of San Francisco

Over the past decade, deep learning has been remarkably successful at solving a massive set of problems on datatypes including images and sequential data. This success drove the extension of deep learning to other discrete domains such as sets, point clouds, graphs, 3D shapes, and discrete manifolds. While many of the extended schemes have successfully tackled notable challenges in each domain, the plethora of fragmented frameworks have created or resurfaced many long-standing problems in deep learning such as explainability, expressiveness and generalizability. Moreover, theoretical development proven over one discrete domain does not naturally apply to the other domains. Finally, the lack of a cohesive mathematical framework has created many ad hoc and inorganic implementations and ultimately limited the set of practitioners that can potentially benefit from deep learning technologies. This talk introduces the foundation of topological deep learning, a rapidly growing field that is concerned with the development of deep learning models for data supported on topological domains such as simplicial complexes, cell complexes, and hypergraphs, which generalize many domains encountered in scientific computations including images and sequence data. It introduces the main notions while maintaining intuitive conceptualization, implementation and relevance to a wide range of practical applications. It also demonstrates the practical relevance of this framework with practical applications ranging from drug discovery to mesh and image segmentation.

9:00 am - 11:00 am Machine Learning

Introduction to Math for Data Science

Thomas Nield Instructor at University of Southern California, Founder | Nield Consulting Group and Yawman Flight

With the availability of data, there is a growing demand for talent who can analyze and make sense of it. This makes practical math all the more important because it helps infer insights from data. However, mathematics comprises many topics, and it is hard to identify which ones are applicable and relevant for a data science career. Knowing these essential math topics is key to integrating knowledge across data science, statistics, and machine learning. It has become even more important with the prevalance of libraries like PyTorch and scikit-learn, which can create """"""""black box"""""""" approaches where data science professionals use these libraries but do not fully understand how they work. In this training, Thomas Nield (author of O'Reilly book """"""""Essential Math for Data Science"""""""") will provide a crash-course of carefully curated topics to jumpstart proficiency in key areas of mathematics. This includes probability, statistics, hypothesis testing, and linear algebra. Along the way you’ll integrate what you’ve learned and see practical applications for real-world problems. These examples include how statistical concepts apply to machine learning, and how linear algebra is used to fit a linear regression. We will also use Python to explore ideas in calculus and model-fitting, using a combination of libraries and from-scratch approaches.

9:00 am - 11:00 am Machine Learning

Data Wrangling with Python

Sheamus McGovern CEO and Software Architect, Data Engineer, and AI expert at ODSC

Data wrangling is the cornerstone of any data-driven project, and Python stands as one of the most powerful tools in this domain. In preparation for the ODSC conference, our specially designed course on “Data Wrangling with Python” offers attendees a hands-on experience to master the essential techniques. From cleaning and transforming raw data to making it ready for analysis, this course will equip you with the skills needed to handle real-world data challenges. As part of a comprehensive series leading up to the conference, this course not only lays the foundation for more advanced AI topics but also aligns with the industry’s most popular coding language. Upon completion of this short course attendees will be fully equipped with the knowledge and skills to manage the data lifecycle and turn raw data into actionable insights, setting the stage for advanced data analysis and AI applications.

11:30 am - 1:30 pm Data Visualization

A Practical Introduction to Data Visualization for Data Scientists

Robert Kosara Data Visualization Developer at Observable

How does data visualization work, and what can it do for you? In this workshop, data visualization researcher and developer Robert Kosara will teach you the basics of how and why to visualize data, and show you how to create interactive charts using open-source tools. You'll learn… - the fundamental building blocks of data visualization: visual variables, data mappings, etc. - the difference between continuous and categorical data, and what it means for data visualization and the use of color - what grammars of graphics are (the 'gg' in 'ggplot'!) and how they help make more interesting visualizations - the basic chart types, how they work, and what they're best used for - a few unusual chart types and when to use them - how to prepare data for common data visualization tools - how to build a simple interactive modeling tool that combines observed and modeled data in a single visualization - when to use common charts vs. when to go for bespoke or unusual visualizations We'll build all these visualizations using the open-source Observable Plot framework, but the concepts apply similarly to many others (such as ggplot, vega-lite, etc.). To follow along, you'll need a computer with an editor (such as Visual Studio Code) as well as a download of the project we provide (see the prerequisites).

11:30 am - 1:30 pm Machine Learning

Introduction to Machine Learning with Python

Sudip Shrestha, PhD Data Science Lead/ Sr. Manager at Asi Government

The ""Introduction to Machine Learning with Python"" is designed for those seeking to understand the growing field of Machine Learning (ML), a key driver in today’s data-centric world. This training offers foundational knowledge in ML, emphasizing its importance in various industries for informed decision-making and technological advancements. Participants will learn about different ML types, including supervised and unsupervised learning, and explore the complete lifecycle of an ML model—from data preprocessing to deployment. The course highlights Python’s role in ML, introducing essential tools and libraries for algorithm implementation. A practical component involves hands-on implementation of an ML use case, consolidating theoretical knowledge with real-world application. Ideal for beginners, this course provides a comprehensive yet concise introduction to ML, equipping attendees with the skills to apply ML concepts effectively in diverse scenarios.

Machine Learning

Data Primer Course (Self-paced)

ODSC Instructor

Data is the essential building block of Data Science, Machine Learning and AI. This course is the first in the series and is designed to teach you the foundational skills and knowledge required to understand, work with, and analyze data. It covers topics such as data collection, organization, profiling, and transformation as well as basic analysis. The course is aimed at helping people begin their AI journey and gain valuable insights that we will build up in subsequent SQL, programming, and AI courses.

Machine Learning

Introduction to R

ODSC Instructor

Dive into the world of R programming in this interactive workshop, designed to hone your data analysis and visualization skills. Begin with a walkthrough of the Colab interface, understanding cell manipulation and library utilization. Explore core R data structures like vectors, lists, and data frames, and learn data wrangling techniques to manipulate and analyze datasets. Grasp the basics of programming with iterations and function applications, transitioning into Exploratory Data Analysis (EDA) to derive insights from your data. Discover data visualization using ggplot2, unveiling the stories hidden within data. Lastly, get acquainted with RStudio, the robust Integrated Development Environment, enhancing your R programming journey. This workshop is your gateway to mastering R, catering to both novices and seasoned programmers.

Machine Learning

Introduction to AI (Self-paced)

ODSC Instructor

This AI literacy course is designed to introduce participants to the basics of artificial intelligence (AI) and machine learning. We will first explore the various types of AI and then progress to understand fundamental concepts such as algorithms, features, and models. We will study the machine learning workflow and how it is used to design, build, and deploy models that can learn from data to make predictions. This will cover model training and types of machine learning including supervised, and unsupervised learning, as well as some of the most common models such as regression and k-means clustering. Upon completion, individuals will have foundational understanding of machine learning and its capabilities and be well-positioned to take advantage of introductory-level hands-on training in machine learning and data science such as ODSDC East’s Mini-Bootcamp.

Natural Language Processing

Introduction to NLP

ODSC Instructor

Welcome to the Introduction to NLP workshop! In this workshop, you will learn the fundamentals of Natural Language Processing. From tokenization and stop word removal to advanced topics like deep learning and large language models, you will explore techniques for text preprocessing, word embeddings, classic machine learning, and cutting-edge NLP methods. Get ready to dive into the exciting world of NLP and its applications!

Large Language Models

Virtual conference

Introduction to Large Language Models

ODSC Instructor

This hands-on course serves as a comprehensive introduction to Large Language Models (LLMs), covering a spectrum of topics from their differentiation from other language models to their underlying architecture and practical applications. It delves into the technical aspects, such as the transformer architecture and the attention mechanism, which are the cornerstones of modern language models. The course also explores the applications of LLMs, focusing on zero-shot learning, few-shot learning, and fine-tuning, which showcase the models’ ability to adapt and perform tasks with limited to no examples. Furthermore, it introduces the concept of flow chaining as a method to generate coherent and extended text, demonstrating its usefulness in tackling token limitations in real-world scenarios such as Q&A bots. Through practical examples and code snippets, participants are given a hands-on experience on how to utilize and harness the power of LLMs across various domains. By utilizing the code notebooks included in this course, participants can code alongside the code instructor to ensure hands-on practice experience in LLMs

9:30 am - 11:30 am LLMs

Should I Use RAG or Fine-Tuning? Building with Llama 3 and Arctic Embed

Chris Alexiuk Head of LLMs at AI Makerspace | Founding Machine Learning Engineer at Ox
Greg Loughnane Co-Founder & CEO at AI Makerspace

One question we get a lot as we teach students around the world to build, ship, and share production-grade LLM applications is “Should I use RAG or fine-tuning?“ The answer is yes. You should use RAG AND fine-tuning, especially if you’re aiming at human-level performance in production. In 2024 you should be thinking about using agents too! To best understand exactly how and when to use RAG and Supervised Fine-Tuning (a.k.a SFT or just fine-tuning), there are many nuances that we must consider! In this event, we’ll zoom in on prototyping LLM applications and describe how practitioners should think about leveraging the patterns of RAG, fine-tuning, and agentic reasoning. We’ll dive into RAG and how fine-tuned models and agents are typically leveraged within RAG applications. Specifically, we will break down Retrieval Augmented Generation into dense vector retrieval plus in-context learning. With this in mind, we’ll articulate the primary forms of fine-tuning you need to know, including task training, constraining the I-O schema, and language training in detail. Finally, we’ll demystify the language behind the oft-confused terms agent, agent-like, and agentic by describing the simple meta-pattern of reasoning-action and its fundamental roots in if-then thinking. Finally, we’ll provide an end-to-end domain-adapted RAG application to solve a use case. All code will be demoed live, including what is necessary to build our RAG application with LangChain v0.1 and to fine-tune an open-source embedding model from Hugging Face! You’ll learn: - RAG and fine-tuning are not alternatives, but rather two pieces to the puzzle - RAG, fine-tuning, and agents are not specific *things.* They are patterns. - How to build a RAG application using fine-tuned domain-adapted embeddings **Who should attend the event?** - Any GenAI practitioner who has asked themselves “Should I use RAG or fine-tuning?” - Aspiring AI Engineers looking to build and fine-tune complex LLM applications - AI Engineering leaders who want to understand the primary patters for GenAI prototypes Module 1: The Patterns of GenAI We will break down Retrieval Augmented Generation into dense vector retrieval plus in-context learning. With this in mind, we’ll articulate the primary forms of fine-tuning you need to know, including task training, constraining the I-O schema, and language training in detail. Finally, we’ll demystify the language behind the oft-confused terms agent, agent-like, and agentic by describing the simple meta-pattern of reasoning-action and its fundamental roots in if-then thinking. Module 2: Building a simple RAG application with LangChain v0.1 and Llama 3 Leveraging LangChain Expression Language and LangChain v0.1, we’ll build a simple RAG prototype using OpenAI’s GPT 3.5 Turbo, OpenAI’s text-3-embedding-small, and a FAISS vector store! Module 3: Fine-Tuning an Open-Source Embedding Model Leveraging Quantization via the bitsandbytes library, Low Rank Adaptation (LoRA) via the Hugging Face PEFT library, and the Massive Text Embedding Benchmark leaderboard, we’ll adapt the embedding space of our off-the-shelf model (Arctic Embed) to a particular domain! Module 4: Constructing a Domain-Adapted RAG System In the final module, we’ll assemble our domain-adapted RAG system, and discuss where we might leverage agentic reasoning if we kept building the system in the future!

9:30 am - 11:30 am Generative AI

Generative AI

Leonardo De Marchi VP of Labs at Thomson Reuters

Creativity is now not only a human exclusive. This workshop is designed to explore how artificial intelligence can be used to generate creative outputs and to inspire technical audiences to use their skills in new and creative ways. The workshop will also include a series of code exercises designed to give participants hands-on experience working with AI models to generate creative outputs. Some of the exercises we will cover include: - Generating poetry using NLP models like LSTM and Transformer. - Creating digital art using computer vision models like Deep Dream and StyleGAN. - Generating music using GANs and other AI models. Using reinforcement learning to generate creative outputs that match certain criteria or goals. Overall, this workshop is ideal for technical audiences who are interested in exploring the creative possibilities of artificial intelligence. Participants should have a basic understanding of machine learning concepts and be comfortable coding in Python. Join us at odsc.com to discover new ways of using AI to create, innovate and inspire! We will cover a variety of topics related to creativity in AI, including: - Introduction to Creativity in AI: An overview of the different types of AI models and how they can be used to generate creative outputs. - Natural Language Processing (NLP) for Creativity: A deep dive into how NLP can be used to generate creative outputs like poetry, song lyrics, and prose. - Computer Vision for Creativity: How computer vision can be used to generate creative outputs like art and graphic design. - Reinforcement Learning for Creativity: How reinforcement learning can be used to train AI models to generate creative outputs that match certain criteria or goals. - Ethical and Legal Considerations in AI: The ethical implications of using AI to generate creative outputs and how to ensure that these models are used responsibly and ethically. Tools: We will use OpenAI gym to try our RL algorithms. OpenAI is a non profit organization that wants to open source all their research on Artificial Intelligence. To foster innovation OpenAI created a virtual environment, OpenAi gym, where it’s easy to test Reinforcement Learning algorithms. In particular, we will look at popular techniques like Multi Armed Bandit, SARSA and Q-Learning with practical python examples.

12:00 pm - 2:00 pm Data Visualization

Visualization in Bayesian Workflow Using Python or R

Clinton Brownley, PhD Lead Data Scientist at Tala

Visualization can be a powerful tool to help you build better statistical models. In this tutorial, you will learn how to create and interpret visualizations that are useful in each step of a Bayesian regression workflow. A Bayesian workflow includes the three steps of (1) model building, (2) model interpretation, and (3) model checking/improvement, along with model comparison. Visualization is helpful in each of these steps – generating graphical representations of the model and plotting prior distributions aid model building, visualizing MCMC diagnostics and plotting posterior distributions aid interpretation, and plotting posterior predictive, counterfactual, and model comparisons aid model checking/improvement.

2:10 pm - 4:10 pm Generative AI

Generative AI, AI Agents, and AGI - How New Advancements in AI Will Improve the Products We Build

Martin Musiol Co-Founder and Instructor at Generative AI.net| Principal Data Science Manager at Infosys Consulting

This session is tailored for professionals seeking to master the fundamentals of generative AI. Our training covers a comprehensive range of topics, from the basics of text generation using advanced language models to the intricacies of image and 3-D object generation. Attendees will gain hands-on experience with cutting-edge tools, empowering them to become ten times more productive in their roles. A key component of our training is the exploration of autonomous agents. Participants will learn not only how these agents perform various tasks autonomously but also how to build one from the ground up. This segment paves the way to understanding the trajectory towards artificial general intelligence (AGI), a frontier in AI research. This session does not require prior experience in AI, making it accessible to a broad audience. However, it promises maximum knowledge gain, equipping attendees with practical skills and theoretical knowledge. By the end of the session, participants will be able to apply these insights directly to their roles, enhancing their contribution to the AI domain and their respective industries. It will be a comprehensive learning experience, ensuring attendees leave with a profound understanding of generative AI and its applications.

Machine Learning

SQL Primer Course (Self-paced)

ODSC Instructor

This SQL coding course teaches students the basics of Structured Query Language, which is a standard programming language used for managing and manipulating data and an essential tool in AI. The course covers topics such as database design and normalization, data wrangling, aggregate functions, subqueries, and join operations, and students will learn now to design and write SQL code to solve real-world problems. Upon completion, students will have a strong foundation in SQL and be able to use it effectively to extract insights from data. The ability to effectively access, retrieve, and manipulate data using SQL is essential for data cleaning, pre-processing, and exploration, which are crucial steps in any data science or machine learning project. Additionally, SQL is widely used in industry, making it a valuable skill for professionals in the field. This course builds upon the earlier data course in the series.

Prompt Engineering

Prompt Engineering Fundamentals (self-paced)

ODSC Instructor

This workshop on Prompt Engineering explores the pivotal role of prompts in guiding Large Language Models (LLMs) like ChatGPT to generate desired responses. It emphasizes how prompts provide context, control output style and tone, aid in precise information retrieval, offer task-specific guidance, and ensure ethical AI usage. Through practical examples, participants learn how varying prompts can yield diverse responses, highlighting the importance of well-crafted prompts in achieving relevant and accurate text generation. Additionally, the workshop introduces temperature control to balance creativity and coherence in model outputs, and showcases LangChain, a Python library, to simplify prompt construction. Participants are equipped with practical tools and techniques to harness the potential of prompt engineering effectively, enhancing their interaction with LLMs across various contexts and tasks.

Large Language Models

Build a Question & Answering Bot (self-paced)

ODSC Instructor

The workshop notebook delves into building a Question and Answering Bot based on a fixed knowledge base, covering the integration of concepts discussed in earlier notebooks about LLMs (Large Language Models) and prompting. Initially, it introduces a high-level architecture focusing on vector search—a method to retrieve similar items based on vector representations. The notebook explains the steps involved in vector search including vector representation, indexing, querying, similarity measurement, and retrieval, detailing various technologies used for vector search such as vector libraries, vector databases, and vector plugins. The example utilizes an Open Source vector database, Chroma, to index data and uses state-of-the-union text data for the exercise. The notebook then transitions into the practical implementation, illustrating how text data is loaded, chunked into smaller pieces for effective vector search, and mapped into numeric vectors using the MPNetModel from the SentenceTransformer library via HuggingFace. Following this, the focus shifts to text generation where Langchain Chains are introduced. Chains, as described, allow for more complex applications by chaining several steps and models together into pipelines. A RetrievalQA chain is used to build a Q&A Bot application which utilizes an OpenAI chat model for text generation.

Machine Learning

Data Wrangling with Python (Self-paced)

ODSC Instructor

Data wrangling is the cornerstone of any data-driven project, and Python stands as one of the most powerful tools in this domain. In preparation for the ODSC conference, our specially designed course on “Data Wrangling with Python” offers attendees a hands-on experience to master the essential techniques. From cleaning and transforming raw data to making it ready for analysis, this course will equip you with the skills needed to handle real-world data challenges. As prat of a comprehensive series leading up to the conference, this course not only lays the foundation for more advanced AI topics but also aligns with the industry’s most popular coding language. Upon completion of this short course, attendees will be fully equipped with the knowledge and skills to manage the data lifecycle and turn raw data into actionable insights, setting the stage for advanced data analysis and AI applications.

Large Language Models

Retrieval-Augmented Generation (self-paced)

ODSC Instructor

Retrieval-Augmented Generation (RAG) is a powerful natural language processing (NLP) architecture introduced in this workshop notebook. RAG combines retrieval and generation models, enhancing language understanding and generation tasks. It consists of a retrieval component, which efficiently searches vast text databases for relevant information, and a generation component, often based on Transformer models, capable of producing coherent responses based on retrieved context. RAG’s versatility extends to various NLP applications, including question answering and text summarization. Additionally, this notebook covers practical aspects such as indexing content, configuring RAG chains, and incorporating prompt engineering, offering a comprehensive introduction to harnessing RAG’s capabilities for NLP tasks.

Large Language Models

Parameter Efficient Fine tuning (self-paced)

ODSC Instructor

For the next workshop, our focus will be on parameter-efficient fine-tuning (PEFT) techniques in the field of machine learning, specifically within the context of large neural language models like GPT or BERT. PEFT is a powerful approach that allows us to adapt these pre-trained models to specific tasks while minimizing additional parameter overhead. Instead of fine-tuning the entire massive model, PEFT introduces compact, task-specific parameters known as “adapters” into the pre-trained model’s architecture. These adapters enable the model to adapt to new tasks without significantly increasing its size. PEFT strikes a balance between model size and adaptability, making it a crucial technique for real-world applications where computational and memory resources are limited, while still maintaining competitive performance. In this workshop, we will delve into the different PEFT methods, such as additive, selective, re-parameterization, adapter-based, and soft prompt-based approaches, exploring their characteristics, benefits, and practical applications. We will also demonstrate how to implement PEFT using the Hugging Face PEFT library, showcasing its effectiveness in adapting large pre-trained language models to specific tasks. Join us to discover how PEFT can make state-of-the-art language models more accessible and practical for a wide range of natural language processing tasks.

Large Language Models

LangChain Agents (self-paced)

ODSC Instructor

The “LangChain Agents” workshop delves into the “Agents” component of the LangChain library, offering a deeper understanding of how LangChain integrates Large Language Models (LLMs) with external systems and tools to execute actions. This workshop builds on the concept of “chains,” which can link multiple LLMs to tackle various tasks like classification, text generation, code generation, and more. “Agents” enable LLMs to interact with external systems and tools, making informed decisions based on available options. The workshop explores the different types of agents, such as “Zero-shot ReAct,” “Structured input ReAct,” “OpenAI Functions,” “Conversational,” “Self ask with search,” “ReAct document store,” and “Plan-and-execute agents.” It provides practical code examples, including initializing LLMs, defining tools, creating agents, and demonstrates how these agents can answer questions using external APIs, offering participants a comprehensive overview of LangChain’s agent capabilities.

Large Language Models

Fine Tuning an Existing LLM (self-paced)

ODSC Instructor

The workshop explores the process of fine-tuning Large Language Models (LLMs) for Natural Language Processing (NLP) tasks. It highlights the motivations for fine-tuning, such as task adaptation, transfer learning, and handling low-data scenarios, using a Yelp Review dataset. The notebook employs the HuggingFace Transformers library, including tokenization with AutoTokenizer, data subset selection, and model choice (BERT-based model). Hyperparameter tuning, evaluation strategy, and metrics are introduced. It also briefly mentions DeepSpeed for optimization and Parameter Efficient Fine-Tuning (PEFT) for resource-efficient fine-tuning, providing a comprehensive introduction to fine-tuning LLMs for NLP tasks.

Large Language Models

Fine Tuning Embedding Models (self-paced)

ODSC Instructor

This workshop explores the importance of fine-tuning Language and Embedding Models (LLMs). It highlights how embedding models are used to map natural language to vectors, crucial for pipelines with multiple models to adapt to specific data nuances. An example demonstrates fine-tuning an embedding model for legal text. The notebook discusses existing solutions and hardware considerations, emphasizing GPU usage for large data. The practical part of the notebook shows the fine-tuning process of the “distilroberta-base” model from the SentenceTransformer library. It utilizes the QQP_triplets dataset from Quora for training, designed around semantic meaning. The notebook prepares the data, sets up a DataLoader, and employs Triplet Loss to encourage the model to map similar data points closely while distancing dissimilar ones. It concludes by mentioning the training duration and resources needed for further improvements.

Prompt Engineering

Prompt Engineering with OpenAI (self-paced)

ODSC Instructor

This workshop on prompt engineering with OpenAI discussed best practices for utilizing OpenAI models. We will review how to separate instructions and context using special characters to help improve instruction clarity, context isolation, and enhances control over the generation process. The workshop also included code for installing the langchain library and demonstrated how to create prompts effectively, emphasizing the importance of clarity, specificity, and precision in prompts. Additionally, the workshop showed how to craft prompts for specific tasks, such as extracting entities from text. It provided templates for prompts and highlighted the significance of specifying the desired output format through examples for improved consistency and customization. Lastly, the workshop addressed the importance of using prompts as safety guardrails. It introduced prompts to mitigate hallucination and jailbreaking risks by instructing the model to generate well-supported and verifiable information, thereby promoting responsible and ethical use of language models.


ODSC is live

Participate at ODSC East 2025

More Info
Submit a Session

As part of the global data science community we value inclusivity, diversity,  and fairness in the pursuit of knowledge and learning. We seek to deliver a conference agenda, speaker program, and attendee participation that moves the global data science community forward with these shared goals. Learn more on our code of conductspeaker submissions, or speaker committee pages.

ODSC Newsletter

Stay current with the latest news and updates in open source data science. In addition, we’ll inform you about our many upcoming Virtual and in person events in Boston, NYC, Sao Paulo, San Francisco, and London. And keep a lookout for special discount codes, only available to our newsletter subscribers!

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google