Abstract: In this talk, we present recent progress on large-scale learning of multimodal video representations. We begin by presenting VideoBert, a joint model for video and language, repurposing the Bert model for multimodal data. This model achieves state-of-the-art results for zero-shot prediction and video captioning. Next, we introduce Vid2Seq, a model for dense video captioning that takes as input video and speech and predicts both temporal boundaries and textual descriptions simultaneously. We then present an approach for video question answering and image captioning that relies on a retrieval-augmented visual language model that learns to encode world knowledge into a large-scale memory and to retrieve from it to answer knowledge-intensive queries. We show that our approach achieves state-of-the-art results on visual question answering and image captioning. We conclude the presentation with recent work on vision guided navigation and robot manipulation given language instructions. This work builds on and extends vision-language transformers by integrating action history and predicting actions. We outperform the state of the art on different vision-language-navigation benchmarks and RLBench, a benchmark for robot manipulation.
Bio: Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate in Computer Science, from the Institut National Polytechnique de Grenoble (INPG). Her doctoral thesis on "Local Greyvalue Invariants for Image Matching and Retrieval" received the best thesis award from INPG in 1996. She received the Habilitation degree in 2001 for her thesis entitled "From Image Matching to Learning Visual Models". Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University in 1996--1997. Since 1997 she has held a permanent research position at Inria, where she is a research director. Dr. Schmid is a member of the German National Academy of Sciences, Leopoldina and a fellow of IEEE and the ELLIS society. She was awarded the Longuet-Higgins prize in 2006, 2014 and 2016 and the Koenderink prize in 2018, both for fundamental contributions in computer vision that have withstood the test of time. She received an ERC advanced grant in 2013, the Humbolt research award in 2015, the Inria & French Academy of Science Grand Prix in 2016, the Royal Society Milner award in 2020 and the PAMI distinguished researcher award in 2021. Dr. Schmid has been an Associate Editor for IEEE PAMI (2001--2005) and for IJCV (2004--2012), an editor-in-chief for IJCV (2013--2018), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015, ECCV 2020 and ICCV 2023. Starting 2018 she holds a joint appointment with Google research.