Abstract: Multimodal AI is a fast-growing field where deep neural networks are trained using multiple types of input data simultaneously (e.g. text, image, video, audio). Multimodal models perform better in content understanding applications, and are setting new standards for content generation in models such as DALL-E and StableDiffusion. Building multimodal models is hard; In this session we share more about multimodal AI, why you should care about it, what are some challenges you might face and how TorchMultimodal, our new PyTorch domain library eases the developer experience of building multimodal models.
Bio: Suraj is an ML engineer and developer advocate at Meta AI. In a previous life, he was a data scientist in personal finance. After being bitten by the deep learning bug, he worked in healthcare research (predicting patient risk factors) and behavioral finance (preventing overly-risky trading). Outside of work, you can find him hiking barefoot in the Catskills or being tossed on the Aikido mat.