Vision Transformer (ViT), specifically designed for the Computer Vision (CV) field, is an AI architecture that utilizes the Transformer architecture to process visual data. This post will follow through the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al. [1], introducing the concept of the Vision Transformer model and evaluating its perfo..