
Abstract: Vision transformer is a recent breakthrough in the area of computer vision. While transformer-based models have dominated the field of natural language processing since 2017, CNN-based models are still demonstrating state-of-the-art performances in vision problems. Last year, a group of researchers from Google figured out how to make a transformer work on recognition. They called it "vision transformer". The follow up works by the community demonstrated superior performance of vision transformers not only in recognition but also in other downstream tasks such as detection, segmentation, multi-modal learning and scene text recognition to mention a few.
In this talk, we will go into a deeper understanding of the model architecture of vision transformers. Most importantly, we will focus on the concept of self-attention and its role in vision. Then, we will present different model implementations utilizing the vision transformer as the main backbone.
Since self-attention can be applied beyond transformers, we will also discuss a promising direction in building general-purpose model architectures. In particular, networks that can process a variety of data formats such as text, audio, image and video.
Background Knowledge
Working knowledge of deep learning for computer vision
Bio: Rowel Atienza is a Professor and Scientist at the Electrical and Electronics Engineering Institute of the University of the Philippines, Diliman. He holds the Dado and Maria Banatao Institute Professorial Chair in Artificial Intelligence. He received his MEng from the National University of Singapore for his work on an AI-enhanced four-legged robot. He finished his Ph.D. at The Australian National University for his contribution on the field of active gaze tracking for human-robot interaction. Dr. Atienza is the author of Advanced Deep Learning with TensorFlow 2 and Keras. His current research work focuses on robotics, computer vision and AI.

Rowel Atienza, PhD
Title
Professor | Scientist | University of the Philippines
