Abstract: Data Scientists collectively spend over half their time collecting, cleaning, transforming, enriching, and organizing their data. It’s necessary but not a great use of our time and talents. Much of this work happens using data stored in files, and often with datasets that contain millions (or even billions) of files. At these scales, the storage and file utilities become a bottleneck in the pipeline – if it takes many minutes just to list the files in a directory, the whole workflow bogs down. In this talk, we’ll show how the RapidFile Toolkit developed by Pure Storage harnesses the power of parallel operations. Combined with high-performance NFS on the FlashBlade (which was designed from the ground up for concurrent operations), the RapidFile Toolkit can help accelerate your workflows, speed up gathering data subsets or samples, and keep teams productive at any scale.
Bio: Miroslav is a Field Solution Evangelist focused on AI and Analytics. He has extensive experience in storage systems, workload modeling, system architecture, tuning, and benchmarking. Miroslav loves both learning and teaching, and has translated those passions into helping organizations get value from innovative technologies. He has previously worked with top companies (including Facebook, NetApp, and Sun) and a number of startups, and is proud to share his experience. Much of Miroslav’s career has revolved around technical evangelism and performance — focused on system architecture and storage, and blending time with customers with time in the lab. He’s been at this for over 30 years now, with his first paid computer gig involving DOS batch scripts the summer before college.