Abstract: Data science is hard enough without data collection and ETL issues making it more difficult. Data mesh architectures are a significant advance in the way we work with data. They provide a decentralized model for domain-oriented data ownership, data as a product, self-service data access, and federated governance. To a data warehouse or data lake developer experienced in the challenges of getting useful and reliable data into the hands of users, this may sound like an unobtainable ideal. Its not. Bioinformatics, which is essentially data science for genetics and proteomics, has been built on a data sharing and stewardship model with similar features to data mesh architecture. Recent advances in vaccine development would not have occurred at the pace they did without the well-established data-sharing model of bioinformatics.
This presentation argues for the large-scale adoption of data mesh principles to advance data science. Specifically, there is a need for domain-specific data standards including well-defined data structures for key entities in the domain and metadata to support particular use cases. Examples will demonstrate how bioinformaticians create data pipelines that draw from data sources about gene (GenBank) and protein (UniProt) sequences, protein structures (Protein Data Bank), gene expression (Expression Atlas), bioactive molecules (ChemBL), and metabolic and signaling pathways (KEGG Pathway Database). We will also review an example metadata standard for human pathogen genomic sequences and describe why domain-specific metadata is needed in addition to common metadata standards. The talk will conclude with tips on how to create data products within a data mesh architecture.
Attendees do not need to be familiar with these tools
Bio: Dan Sullivan is a Principal Data Architect at 4 Mile Analytics. He has a PhD in Genetics, Bioinformatics and Computational Biology and is a former research scientist in infectious disease genomics. He is the author of Google Cloud Certification Study Guides for the Professional Data Engineer, Professional Architect, and Associate Cloud Engineer certifications and an instructor on LinkedIn Learning and Udemy where he provides courses on data science, data modeling, and cloud computing.