Trial, Error, Triumph: Lessons Learned using LLMs for Creating Machine Learning Training Data


We've all been in situations where we'd like to build a model but lack the labeled training data to do so. I plan to discuss how the advent of Large Language Models (LLMs) like GPT-4 has opened new avenues for generating training data. Traditionally, the creation of NLP datasets relied heavily on manual, crowdsourced handlabeling, often resorting to platforms like Mechanical Turk. This approach, while effective, presented significant challenges in terms of cost, time, and scalability.

In this talk, I will share a comprehensive narrative of our journey from initial trials and errors to eventual triumphs in using LLMs for NLP data generation. The shift from manual to AI-assisted data creation marks a pivotal change in how we approach NLP model training. My team and I navigated through various challenges, experimenting with different strategies and learning valuable lessons along the way.

I will discuss how we harnessed the power of LLMs to generate vast amounts of diverse, nuanced data, significantly reducing the time and cost compared to traditional methods. The talk will cover practical insights into fine-tuning these models for specific domains, ensuring data quality, and avoiding common pitfalls such as biases and overfitting.

Moreover, I will highlight how LLMs can be creatively used to simulate real-world scenarios, providing richer and more contextually relevant training data. This not only improves the performance of traditional NLP models but also opens up possibilities for exploring new problem spaces within NLP.

Attendees will leave with a deeper understanding of the potential and limitations of using LLMs in NLP data generation. They will gain actionable insights and strategies that can be applied in their own NLP projects, accelerating their journey from trial to triumph in the realm of AI-powered data science.


Matt Dzugan is the director of data at Muck Rack, the software platform enabling thousands of organizations including Google, Golin and Duolingo to build trust, tell their stories and demonstrate the unique value of earned media. In this role, he oversees teams of engineers furthering data insight and delivery architectures across Muck Rack’s various AI-powered data platforms. Additionally, he manages customer activity and workflow data to assist PR teams in harnessing data effectively to maximize outputs using AI. Before joining Muck Rack, Matt was the director of data science at project44, a supply chain technology company, where he dedicated two years to constructing and broadening the organization’s Data Science department. Prior to this role, he held the position of data science manager at Uptake Technologies and worked as a systems engineer at The Boeing Company. Matt holds Master’s and Bachelor’s Degrees in Electrical Engineering and Computer Science from Northwestern University. He lives in Chicago.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google