Data Science, Meet Data Mesh: What We Can Learn from Bioinformatics about the Power of Standardization in Distributed Systems

Abstract: 

Data science is hard enough without data collection and ETL issues making it more difficult. Data mesh architectures are a significant advance in the way we work with data. They provide a decentralized model for domain-oriented data ownership, data as a product, self-service data access, and federated governance. To a data warehouse or data lake developer experienced in the challenges of getting useful and reliable data into the hands of users, this may sound like an unobtainable ideal. It’s not. Bioinformatics, which is essentially data science for genetics and proteomics, has been built on a data sharing and stewardship model with similar features to data mesh architecture. Recent advances in vaccine development would not have occurred at the pace they did without the well-established data-sharing model of bioinformatics.

This presentation argues for the large-scale adoption of data mesh principles to advance data science. Specifically, there is a need for domain-specific data standards including well-defined data structures for key entities in the domain and metadata to support particular use cases. Examples will demonstrate how bioinformaticians create data pipelines that draw from data sources about gene (GenBank) and protein (UniProt) sequences, protein structures (Protein Data Bank), gene expression (Expression Atlas), bioactive molecules (ChemBL), and metabolic and signaling pathways (KEGG Pathway Database). We will also review an example metadata standard for human pathogen genomic sequences and describe why domain-specific metadata is needed in addition to common metadata standards. The talk will conclude with tips on how to create data products within a data mesh architecture.

Background Knowledge
Attendees do not need to be familiar with these tools

Bio: 

Dan Sullivan is a Principal Data Architect at 4 Mile Analytics. He has a PhD in Genetics, Bioinformatics and Computational Biology and is a former research scientist in infectious disease genomics. He is the author of Google Cloud Certification Study Guides for the Professional Data Engineer, Professional Architect, and Associate Cloud Engineer certifications and an instructor on LinkedIn Learning and Udemy where he provides courses on data science, data modeling, and cloud computing.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google