From Zero to 100: Lakehouse Architecture for a Privacy Focused Search Engine

Abstract: 

Being data-driven improves decision-making outcomes and enables automation, but building data-driven tooling and culture is a complex and challenging task, especially for startups with limited resources. We will discuss this difficult task of creating an analytics platform from scratch at [you.com](http://you.com/) to protect user privacy while driving decision-making across the organization.

The amount of data created daily is exponentially rising, and harnessing that data effectively and ethically is crucial for success in today’s world. We’ll talk about automatic data collection with privacy constraints and the infrastructure setup for data ingestion (Kafka), persistence (Delta Lake, Azure Databricks Lakehouse), and processing (Spark Batch, Spark SQL, Spark Streaming). We’ll walk through the lessons learned in bringing large volumes of data into a single platform for data analytics. Manual ETL processes that took weeks could now be automated in 10 minutes or less. A deeper understanding of the data was cultivated using a rich data taxonomy borrowed from the medallion architecture (raw, refined, dataset). With this new architecture, You.com built the analytics and experimentation platform to drive adoption and confidence in data-centric decisions making.

Bio: 

Zairah is a Data Scientist at you.com, the AI search engine, where she leverages her expertise in statistical and machine-learning techniques to build analytics and experimentation platforms. She recently spoke at NeurIPS 2022 and shared her expertise on data-driven decision-making in a privacy-focused AI-first startup. Previously, Zairah was a Data Scientist at IBM Research, researching Natural Language Processing (NLP) and AI Fairness topics. She has published research and holds patents in these domains. Zairah obtained her M.S. in Computer Science from the University of Pennsylvania, where she researched scikit-learn model performance. Her findings have since been used as guidelines for applying machine learning to supervised classification tasks. Zairah has published her work in top AI conferences such AAAI and has over 300 citations. Aside from work, Zairah enjoys adventure sports and poetry.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google