Sculpting Data for ML: The first act of Machine Learning

Abstract: 

In the contemporary world of machine learning algorithms - “data is the new oil”. For the state-of-the-art ML algorithms to work their magic it’s important to lay a strong foundation with access to relevant data. Volumes of crude data are available on the web nowadays, and all we need are the skills to identify and extract meaningful datasets. This talk aims to present the power of the most fundamental aspect of Machine Learning - Dataset Curation, which often does not get its due limelight. It will also walk the audience through the process of constructing good quality datasets as done in formal settings with a simple hands-on Pythonic example (based on audience/format of session). The goal is to institute the importance of data, especially in its worthy format, and the spell it casts on fabricating smart learning algorithms.

Session Outline

Introduction (10 minutes)
Popularity of Machine Learning & Applications
Significance of honing dataset building skills
Importance in Academia: Expanding domains to perform research on, Solving novel problems using ML, Leading research efforts in this domain, etc.
Importance in Industry: Availability of lots of raw data, no exact dataset available for training purposes, Proactively identifying data to log to solve specific problems, etc.
Finding data source(s) (10 minutes)
Guided Search based on a problem definition: Identifying essential data signals
Unguided Search with no problem definition in mind: Dealing with ambiguity
Tips on identifying data sources.
Data Extraction - Hands-On Example (Audience-level & Time-constraint dependent) (30 - 45 minutes)
Live Python example implemented via Jupyter Notebook
Use of Python tools: BeautifulSoup and Selenium
Step-by-step process to plan data extraction
Nitty-gritty details about tools and the extraction code itself

Bio: 

Rishabh Misra is a Machine Learning Engineer at Twitter, Inc. He developed a passion for identifying and tackling novel and practical problems using Machine Learning during his research internships at the Indian Institute of Technology Madras, which he further explored during his Master's in Computer Science from the University of California San Diego. He combines his past engineering experiences in designing large-scale systems, working at Amazon and Arcesium (a D.E. Shaw company), and research experiences in Applied Machine Learning to develop distributed Machine Learning relevance systems at Twitter.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google