Neha Patki

Neha Patki

Co-Founder & Head of Product at DataCebo

    Bio Coming Soon!

    All Sessions by Neha Patki

    Day 2 04/24/2024
    3:30 pm - 4:00 pm

    Metrics & Visualizations for Evaluating Synthetic Data Quality

    <span class="etn-schedule-location"> <span class="firstfocus">Machine Learning</span> </span>

    Synthetic data has shown great promise for solving a variety of problems like addressing data scarcity for AI and overcoming barriers to data access. But the field of synthetic data generation is still extremely nascent and we haven’t converged on a set of common benchmarks for evaluating the quality of synthetic data. Our team originally came from MIT’s Data-to-AI Lab and we’ve spent years researching and collecting the best metrics for evaluating synthetic data quality like CategoricalCAP, Boundary Adherence, and more. Learning Objectives Learn the basic approach of evaluating synthetic data by comparing columns with your original data. Most of the data in organizations and business is structured, relational, and tabular. Learn about the unique problems that synthetic data generation can solve based on our experience helping thousands of individuals work with synthetic data. Choosing the right synthetic data quality metrics isn’t easy and is tied closely to the goal of your project. We’ll showcase our recommended framework, which incorporates the context & expertise of domain experts and specific, interpretable statistical measures. Learn which metrics and visualizations you should use for each data type. What are the most common pitfalls and mistakes people make when generating synthetic data? Takeaways Statistical measures are necessary but insufficient for evaluating synthetic data. Domain expertise is important for defining business rules that your data should follow, independent of just the quality score itself. Using side-by-side visualizations of quality scores can help communicate synthetic data quality to your stakeholders and collaborators. The goals of a project play a big factor in how you evaluate the quality of synthetic data. When evaluating synthetic data, avoid common statistical pitfalls. For example, it’s tempting to rely on correlation between columns in the original data and synthetic data but often the linearity assumption is violated. Tools Plotly and SDMetrics, both completely open source (MIT licensed) Examples of visualizations we’ll showcase are here and here.

    Open Data Science

     

     

     

    Open Data Science
    One Broadway
    Cambridge, MA 02142
    info@odsc.com

    Privacy Settings
    We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
    Youtube
    Consent to display content from - Youtube
    Vimeo
    Consent to display content from - Vimeo
    Google Maps
    Consent to display content from - Google