<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>Jon's Coding Life</title>
    <link>https://jonathanlee.tistory.com/</link>
    <description>Code &amp;amp; Coffee with Jon ☕️

Sharing insights, tips, and a bit of humor from the world of Computer Science.</description>
    <language>ko</language>
    <pubDate>Thu, 11 Jun 2026 00:21:23 +0900</pubDate>
    <generator>TISTORY</generator>
    <ttl>100</ttl>
    <managingEditor>Jonathan Lee</managingEditor>
    <image>
      <title>Jon's Coding Life</title>
      <url>https://tistory1.daumcdn.net/tistory/7340104/attach/7a50fc53bf3c42bc82436a1b03fee587</url>
      <link>https://jonathanlee.tistory.com</link>
    </image>
    <item>
      <title>Vision Transformer: What Is It &amp;amp; How Does It Work?</title>
      <link>https://jonathanlee.tistory.com/1</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Vision Transformer (ViT)&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;, specifically designed for the Computer Vision (CV) field, is an AI architecture that utilizes the Transformer architecture to process visual data.&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;div&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;1024&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/AhjwF/btsKtJd4MBn/4VnX4N4moZMPT23MjE1sp0/img.webp&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/AhjwF/btsKtJd4MBn/4VnX4N4moZMPT23MjE1sp0/img.webp&quot; data-alt=&quot;Figure 1. Abstract Image Explaining ViT, Created by Transformor-base Model (DALL-E) Itself.&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/AhjwF/btsKtJd4MBn/4VnX4N4moZMPT23MjE1sp0/img.webp&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FAhjwF%2FbtsKtJd4MBn%2F4VnX4N4moZMPT23MjE1sp0%2Fimg.webp&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;419&quot; height=&quot;419&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;1024&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 1. Abstract Image Explaining ViT, Created by Transformor-base Model (DALL-E) Itself.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/div&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;This post will follow through the paper &quot;An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale&quot; by Dosovitskiy et al. [1], introducing the concept of the Vision Transformer model and evaluating its performance in CV tasks in detail. An example code of the model is also included.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;Table of Contents:&lt;/b&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;1. Background&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp; &amp;nbsp; 1-1. What Are Transformers and Self-Attention?&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp; &amp;nbsp; 1-2. What is CNN and Why is it So Popular in the Computer Vision Field?&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp; &amp;nbsp; 1-3. Attempts to Utilize Transformers in the Computer Vision Field&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;2. Vision Transformer (ViT)&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #333333; text-align: start;&quot;&gt;&amp;nbsp; &amp;nbsp;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;2-1. Model Overview&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp; &amp;nbsp; 2-2. Model Design Details&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp; &amp;nbsp; 2-3. Code Implementation&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #333333; text-align: start;&quot;&gt;&amp;nbsp; &amp;nbsp;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;2-4. Additional Information&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;3. Experiment and Performance Analysis&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #333333; text-align: start;&quot;&gt;&amp;nbsp; &amp;nbsp;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;3-1. Setup&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp; &amp;nbsp; 3-2. Comparison to the State of the Art&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp; &amp;nbsp; 3-3. Limitations&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;4. Conclusion&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;5. References&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;1. Background&lt;/b&gt;&lt;/h3&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;1-1. What Are Transformers and Self-Attention?&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The Transformer is a neural network architecture based on the multi-head attention mechanism, first introduced in the 2017 paper &quot;Attention Is All You Need,&quot; by Vaswani et al [2]. This architecture transforms texts into number codes known as &quot;tokens&quot; and then analyzes the tokens to understand the context. During this process, important words are amplified, while less important words are diminished. The architecture mainly consists of two parts: encoder and decoder. Both encoder and decoder utilize stacked self-attention and point-wise, fully connected layers. The encoder is used to process input data and the decoder is used to generate output sequences (e.g. translation of the input text) [2].&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1030&quot; data-origin-height=&quot;1470&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/eg0tfj/btsKsuB1WPT/dMKN5g1tCs2ZpkjCLJcHtK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/eg0tfj/btsKsuB1WPT/dMKN5g1tCs2ZpkjCLJcHtK/img.png&quot; data-alt=&quot;Figure 1: The Transformer - Model Architecture [2].&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/eg0tfj/btsKsuB1WPT/dMKN5g1tCs2ZpkjCLJcHtK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Feg0tfj%2FbtsKsuB1WPT%2FdMKN5g1tCs2ZpkjCLJcHtK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;341&quot; height=&quot;487&quot; data-origin-width=&quot;1030&quot; data-origin-height=&quot;1470&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 1: The Transformer - Model Architecture [2].&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;br /&gt;The biggest advantage of transformers is that they don't have to perform multiple calculations like other AI models. This makes the model faster to train and allows the model to understand much longer texts. ChatGPT is a famous example of a transformer-based model [3].&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Transformers specifically utilize a mechanism called &quot;self-attention&quot; to understand longer texts. Self-attention considers other elements in the sequence, including itself, and understands the context through analyzing the relationship between those elements. Ultimately, self-attention allows transformers to capture dependencies between elements far apart, allowing the model to understand the text as a whole [4]. The detailed implementation of the attention mechanism is thoroughly explained in the original paper (&quot;Attention is All You Need&quot;).&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;1-2. What is CNN and Why is it So Popular in the Computer Vision Field?&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #333333; text-align: start;&quot;&gt;CNN is a deep learning model consisted of three main layers: convolutional layer, pooling layer, and fully connected layer. In the convolutional layer, CNNs extract important features using small filters (kernels). In the pooling layer, only the most important features are kept, while other areas are disregarded, minimizing the computation cost. The fully connected layers connect all the layers and make a final prediction based on the extracted features.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1280&quot; data-origin-height=&quot;685&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/r7wvh/btsKtLJJo3B/g5enAIJQ4F1qfaK7EsvsV1/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/r7wvh/btsKtLJJo3B/g5enAIJQ4F1qfaK7EsvsV1/img.jpg&quot; data-alt=&quot;Figure 2: CNN - Model Architecture [5].&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/r7wvh/btsKtLJJo3B/g5enAIJQ4F1qfaK7EsvsV1/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fr7wvh%2FbtsKtLJJo3B%2Fg5enAIJQ4F1qfaK7EsvsV1%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;725&quot; height=&quot;388&quot; data-origin-width=&quot;1280&quot; data-origin-height=&quot;685&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 2: CNN - Model Architecture [5].&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The powerful, yet efficient nature of CNN allowed it to become the most widely used model in the CV field, also allowing the model based on CNN to achieve state-of-the-art (SOTA) performance [5].&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;1-3. Attempts to Utilize Transformers in the Computer Vision Field&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;There have been many attempts to incorporate transformers in the traditionally CNN-dominated CV field, mainly because of its outstanding performance in NLP. Unlike in NLP, applying the self-attention mechanism to all the pixels in an image was unrealistic as it was too computation-heavy. Researchers have employed different techniques to utilize the strengths and avoid the limitations of transformers.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;1. Local Self-Attention:&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Parmar et al. applied self-attention to only local neighborhoods for each query pixel [6].&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;2. Sparse Transformers:&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Child et al. employed scalable approximations to global self-attention in order to reduce the computation cost and make it applicable to images [7].&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;3. Blocks of Varying Sizes:&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Weissenborn et al. applied self-attention in blocks of varying sizes or even &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;only&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; along individual axes to increase efficiency [8].&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;4. Combining CNN and Self-Attention&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Many researchers experimented with combining CNNs and self-attention by augmenting feature maps or further processing the output of CNN using self-attention.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;2. Vision Transformer (ViT)&lt;/b&gt;&lt;b&gt;&lt;/b&gt;&lt;/h3&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;2-1. Model Overview&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Unlike&amp;nbsp;other&amp;nbsp;attempts,&amp;nbsp;ViT&amp;nbsp;strictly&amp;nbsp;follows&amp;nbsp;the&amp;nbsp;original&amp;nbsp;Transformer&amp;nbsp;design&amp;nbsp;to&amp;nbsp;utilize&amp;nbsp;the&amp;nbsp;scalable&amp;nbsp;NLP&amp;nbsp;Transformer&amp;nbsp;architectures,&amp;nbsp;and&amp;nbsp;their&amp;nbsp;efficient&amp;nbsp;implementations,&amp;nbsp;right&amp;nbsp;away&amp;nbsp;without&amp;nbsp;any&amp;nbsp;special&amp;nbsp;modifications.&amp;nbsp;ViT&amp;nbsp;divides&amp;nbsp;the&amp;nbsp;image&amp;nbsp;into&amp;nbsp;small&amp;nbsp;patches&amp;nbsp;(instead&amp;nbsp;of&amp;nbsp;traditional&amp;nbsp;CNN&amp;nbsp;filters),&amp;nbsp;and&amp;nbsp;processes&amp;nbsp;them&amp;nbsp;one&amp;nbsp;by&amp;nbsp;one&amp;nbsp;like&amp;nbsp;text&amp;nbsp;tokens&amp;nbsp;in&amp;nbsp;NLP.&amp;nbsp;&lt;br /&gt;&lt;br /&gt;Simply&amp;nbsp;put,&amp;nbsp;instead&amp;nbsp;of&amp;nbsp;designing&amp;nbsp;a&amp;nbsp;specialized&amp;nbsp;layer&amp;nbsp;for&amp;nbsp;image&amp;nbsp;processing,&amp;nbsp;ViT&amp;nbsp;makes&amp;nbsp;an&amp;nbsp;image&amp;nbsp;act&amp;nbsp;like&amp;nbsp;a&amp;nbsp;block&amp;nbsp;of&amp;nbsp;text&amp;nbsp;with&amp;nbsp;words&amp;nbsp;(small&amp;nbsp;patches).&amp;nbsp;This&amp;nbsp;approach&amp;nbsp;allows&amp;nbsp;the&amp;nbsp;Transformer&amp;nbsp;architecture&amp;nbsp;used&amp;nbsp;in&amp;nbsp;NLP&amp;nbsp;to&amp;nbsp;be&amp;nbsp;directly&amp;nbsp;applied&amp;nbsp;in&amp;nbsp;CV&amp;nbsp;tasks&amp;nbsp;without&amp;nbsp;any&amp;nbsp;modifications.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A&amp;nbsp;brief&amp;nbsp;overview&amp;nbsp;of&amp;nbsp;the&amp;nbsp;model:&lt;/b&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1474&quot; data-origin-height=&quot;1026&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cSMAAm/btsKuR4mVci/bVXP2YwNJ6a3Kkx7zxKE20/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cSMAAm/btsKuR4mVci/bVXP2YwNJ6a3Kkx7zxKE20/img.png&quot; data-alt=&quot;Figure 3: Vision Transformer Model Overview. Reillustrated referring to the diagram in the original paper to improve readability.&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cSMAAm/btsKuR4mVci/bVXP2YwNJ6a3Kkx7zxKE20/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcSMAAm%2FbtsKuR4mVci%2FbVXP2YwNJ6a3Kkx7zxKE20%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1474&quot; height=&quot;1026&quot; data-origin-width=&quot;1474&quot; data-origin-height=&quot;1026&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 3: Vision Transformer Model Overview. Reillustrated referring to the diagram in the original paper to improve readability.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;br /&gt;1.&amp;nbsp;Divide&amp;nbsp;an&amp;nbsp;image&amp;nbsp;into&amp;nbsp;small&amp;nbsp;patches,&amp;nbsp;called&amp;nbsp;tokens,&amp;nbsp;with&amp;nbsp;equal&amp;nbsp;size.&lt;br /&gt;2.&amp;nbsp;Flatten&amp;nbsp;the&amp;nbsp;patches&amp;nbsp;into&amp;nbsp;vectors&amp;nbsp;and&amp;nbsp;embed&amp;nbsp;them&amp;nbsp;linearly.&lt;br /&gt;3. Add positional embeddings to preserve spatial information.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;4. Add a learnable classification token to the sequence.&lt;br /&gt;5. Feed the sequence of tokens into the classical Transformer encoder.&lt;br /&gt;6. Attach classification head to the [class] token throughout the training process to help with classification task.&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&amp;nbsp;&lt;/h4&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;2-2. Model Design Details&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Step 1. Divide the Image into Small Patches&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1514&quot; data-origin-height=&quot;942&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ckS20w/btsKtpBALj1/4zTaHJRVnxy8liVRCNa7nK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ckS20w/btsKtpBALj1/4zTaHJRVnxy8liVRCNa7nK/img.png&quot; data-alt=&quot;Figure 4: Model Diagram of Step 1.&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ckS20w/btsKtpBALj1/4zTaHJRVnxy8liVRCNa7nK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FckS20w%2FbtsKtpBALj1%2F4zTaHJRVnxy8liVRCNa7nK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;500&quot; height=&quot;311&quot; data-origin-width=&quot;1514&quot; data-origin-height=&quot;942&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 4: Model Diagram of Step 1.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Divide the input image into small P X P sized patches, or what is known as tokens in NLP. The number of patches will be $N = HW / P^{2}$ where $(H, W)$ is the resolution of the original image. The number of patches is also equivalent to the length of the input sequence for the Transformer.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Step 2. Create Patch Embeddings&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2092&quot; data-origin-height=&quot;1300&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/punTj/btsKtkHhRIx/kvtsA3sKMKnsPR0Tzl40bK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/punTj/btsKtkHhRIx/kvtsA3sKMKnsPR0Tzl40bK/img.png&quot; data-alt=&quot;Figure 5: Model Diagram of Step 2.&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/punTj/btsKtkHhRIx/kvtsA3sKMKnsPR0Tzl40bK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FpunTj%2FbtsKtkHhRIx%2FkvtsA3sKMKnsPR0Tzl40bK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;500&quot; height=&quot;311&quot; data-origin-width=&quot;2092&quot; data-origin-height=&quot;1300&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 5: Model Diagram of Step 2.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Flatten the patches into vectors and map to D dimensions with a trainable linear projection, as Transformers require constant latent vector size D for all of its layers. The output is called the patch embeddings.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;Step 3. Add [class] Token&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2092&quot; data-origin-height=&quot;1300&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dGWwDN/btsKto3MIU5/ib81tHmHolHequlBfwPUw0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dGWwDN/btsKto3MIU5/ib81tHmHolHequlBfwPUw0/img.png&quot; data-alt=&quot;Figure 6: Model Diagram of Step 3.&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dGWwDN/btsKto3MIU5/ib81tHmHolHequlBfwPUw0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdGWwDN%2FbtsKto3MIU5%2Fib81tHmHolHequlBfwPUw0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;500&quot; height=&quot;311&quot; data-origin-width=&quot;2092&quot; data-origin-height=&quot;1300&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 6: Model Diagram of Step 3.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Like the BERT model's [class] token, a special learnable embedding is added at the front of the created patch embeddings. This token allows the model to summarize key characteristics of the image during the training process. After passing through multiple layers of the Transformer encoder, the token begins to gradually capture the general information of the image. After passing through the Transformer encoder and applying layernorm, this token output finally becomes the image representation $\mathbf{y}$ of the original image ($\mathbf{y} = \text{LN}(\mathbf{z}^{0}_{L})$)&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Step 4. Add Position Embeddings&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2092&quot; data-origin-height=&quot;1300&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/PJSAB/btsKtUA2LpH/Oa3Dm7thgfFSZSO4GHhkck/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/PJSAB/btsKtUA2LpH/Oa3Dm7thgfFSZSO4GHhkck/img.png&quot; data-alt=&quot;Figure 7: Model Diagram of Step 4.&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/PJSAB/btsKtUA2LpH/Oa3Dm7thgfFSZSO4GHhkck/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FPJSAB%2FbtsKtUA2LpH%2FOa3Dm7thgfFSZSO4GHhkck%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;500&quot; height=&quot;311&quot; data-origin-width=&quot;2092&quot; data-origin-height=&quot;1300&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 7: Model Diagram of Step 4.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;To maintain positional information&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;, &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;position&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; embeddings are added to the patch embeddings.&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; Although there are more complex 2D-aware position embeddings, standard learnable 1D position embeddings are used as 2D-aware ones lack significant performance gains. The final output sequence is then fed into the Transformer encoder.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Creating a sequence of embedding vectors $\mathbf{z}_0$ to be fed into the Transformer encoder can be explained simply by using the following equation, introduced in the paper:&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;$$\mathbf{z}_0 = \left[ \mathbf{x}_{\text{class}} ; \mathbf{x}_p^1 \mathbf{E} ; \mathbf{x}_p^2 \mathbf{E} ; \cdots ; \mathbf{x}_p^N \mathbf{E} \right] + \mathbf{E}_{\mathit{pos}}, \quad \mathbf{E} \in \mathbb{R}^{(P^2 \cdot C) \times D}, \quad \mathbf{E}_{\mathit{pos}} \in \mathbb{R}^{(N+1) \times D}$$&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;The&amp;nbsp;input&amp;nbsp;sequence&amp;nbsp;is&amp;nbsp;created&amp;nbsp;by&amp;nbsp;combining&amp;nbsp;[class]&amp;nbsp;token&amp;nbsp;$\mathbf{x}_{\text{class}}$,&amp;nbsp;patch&amp;nbsp;embeddings&amp;nbsp;$\mathbf{x}_p^1&amp;nbsp;\mathbf{E}&amp;nbsp;;&amp;nbsp;\mathbf{x}_p^2&amp;nbsp;\mathbf{E}&amp;nbsp;;&amp;nbsp;\cdots&amp;nbsp;;&amp;nbsp;\mathbf{x}_p^N&amp;nbsp;\mathbf{E}$,&amp;nbsp;and&amp;nbsp;position&amp;nbsp;embedding&amp;nbsp;$\mathbf{E}_{\mathit{pos}}$.&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Step 5. Passing Through the Transformer Encoder&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1934&quot; data-origin-height=&quot;1214&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/xnpIR/btsKtgESI15/MeXFm4xKKMtUv7pxgYcYZK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/xnpIR/btsKtgESI15/MeXFm4xKKMtUv7pxgYcYZK/img.png&quot; data-alt=&quot;Figure 8: Model Diagram of Step 5.&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/xnpIR/btsKtgESI15/MeXFm4xKKMtUv7pxgYcYZK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FxnpIR%2FbtsKtgESI15%2FMeXFm4xKKMtUv7pxgYcYZK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;500&quot; height=&quot;314&quot; data-origin-width=&quot;1934&quot; data-origin-height=&quot;1214&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 8: Model Diagram of Step 5.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;The processed input is fed into The Transformer encoder. The Transformer encoder consists of multiple layers of encoder blocks. Each encoder block consists of alternating multihead self-attention (MSA) and multi-layer perceptron (MLP) blocks. Before each block, layernorm is applied for training stabilization and &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;it&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; is residually connected to the output side of each block.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1034&quot; data-origin-height=&quot;546&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/sFs1T/btsKtKlbrrg/0a05uC2G2GUfntN17k0cT1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/sFs1T/btsKtKlbrrg/0a05uC2G2GUfntN17k0cT1/img.png&quot; data-alt=&quot;Figure 9: Transformer Encoder Architecture Overview. Reillustrated referring to the diagram in the original paper.&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/sFs1T/btsKtKlbrrg/0a05uC2G2GUfntN17k0cT1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FsFs1T%2FbtsKtKlbrrg%2F0a05uC2G2GUfntN17k0cT1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;441&quot; height=&quot;233&quot; data-origin-width=&quot;1034&quot; data-origin-height=&quot;546&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 9: Transformer Encoder Architecture Overview. Reillustrated referring to the diagram in the original paper.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;The Transformer encoder can also be explained using the equations:&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;$$\mathbf{z}'_{\ell} = \text{MSA}(\text{LN}(\mathbf{z}_{\ell-1})) + \mathbf{z}_{\ell-1}, \quad \ell = 1 \dots L&lt;br /&gt;\\&lt;br /&gt;\mathbf{z}_{\ell} = \text{MLP}(\text{LN}(\mathbf{z}'_{\ell})) + \mathbf{z}'_{\ell}, \quad \ell = 1 \dots L$$&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;The first equation is the first block of the Transformer encoder's MSA block. The ouput from $\ell-1$ layer ($\mathbf{z}_{\ell-1}$) is normalized and feed into MSA block $\text{MSA}(\text{LN}(\mathbf{z}_{\ell-1}))$. Then the output from the previous layer is connected to the output from the MSA block (residual connection). The same thing happens with the MLP block, but it additionally contains two layers with a GELU non-linearity.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Step 6. Output and Classification&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2092&quot; data-origin-height=&quot;1298&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/F55lz/btsKtRRTBRm/pH1zduwFaIA7JcpMnM2TyK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/F55lz/btsKtRRTBRm/pH1zduwFaIA7JcpMnM2TyK/img.png&quot; data-alt=&quot;Figure 10: Model Diagram of Step 6.&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/F55lz/btsKtRRTBRm/pH1zduwFaIA7JcpMnM2TyK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FF55lz%2FbtsKtRRTBRm%2FpH1zduwFaIA7JcpMnM2TyK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;500&quot; height=&quot;310&quot; data-origin-width=&quot;2092&quot; data-origin-height=&quot;1298&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 10: Model Diagram of Step 6.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;The classification head is attached to &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;[class&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;] token $\mathbf{z}^{0}_{L}$ during pre-training and fine-tuning. This classification head helps the model to classify the image into the correct category. It is made of MLP with one hidden layer for pre-training and a single linear layer for fine-tuning.&lt;/span&gt;&lt;/p&gt;
&lt;h4 style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&amp;nbsp;&lt;/h4&gt;
&lt;h4 style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;2-3. Code Implementation&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;To see the example code implementation of the model provided by the authors of the paper, click below [9]:&lt;/p&gt;
&lt;div data-ke-type=&quot;moreLess&quot; data-text-more=&quot;더보기&quot; data-text-less=&quot;닫기&quot;&gt;&lt;a class=&quot;btn-toggle-moreless&quot;&gt;더보기&lt;/a&gt;
&lt;div class=&quot;moreless-content&quot;&gt;
&lt;pre id=&quot;code_1730395287272&quot; class=&quot;gml&quot; data-ke-type=&quot;codeblock&quot; data-ke-language=&quot;python&quot;&gt;&lt;code&gt;import torch
from torch import nn

from einops import rearrange
from einops.layers.torch import Rearrange

# helpers

def pair(t):
    return t if isinstance(t, tuple) else (t, t)

def posemb_sincos_2d(h, w, dim, temperature: int = 10000, dtype = torch.float32):
    y, x = torch.meshgrid(torch.arange(h), torch.arange(w), indexing=&quot;ij&quot;)
    assert (dim % 4) == 0, &quot;feature dimension must be multiple of 4 for sincos emb&quot;
    omega = torch.arange(dim // 4) / (dim // 4 - 1)
    omega = 1.0 / (temperature ** omega)

    y = y.flatten()[:, None] * omega[None, :]
    x = x.flatten()[:, None] * omega[None, :]
    pe = torch.cat((x.sin(), x.cos(), y.sin(), y.cos()), dim=1)
    return pe.type(dtype)

# classes

class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, dim),
        )
    def forward(self, x):
        return self.net(x)

class Attention(nn.Module):
    def __init__(self, dim, heads = 8, dim_head = 64):
        super().__init__()
        inner_dim = dim_head *  heads
        self.heads = heads
        self.scale = dim_head ** -0.5
        self.norm = nn.LayerNorm(dim)

        self.attend = nn.Softmax(dim = -1)

        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias = False)
        self.to_out = nn.Linear(inner_dim, dim, bias = False)

    def forward(self, x):
        x = self.norm(x)

        qkv = self.to_qkv(x).chunk(3, dim = -1)
        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -&amp;gt; b h n d', h = self.heads), qkv)

        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        attn = self.attend(dots)

        out = torch.matmul(attn, v)
        out = rearrange(out, 'b h n d -&amp;gt; b n (h d)')
        return self.to_out(out)

class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                Attention(dim, heads = heads, dim_head = dim_head),
                FeedForward(dim, mlp_dim)
            ]))
    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x
        return self.norm(x)

class SimpleViT(nn.Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)

        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'

        patch_dim = channels * patch_height * patch_width

        self.to_patch_embedding = nn.Sequential(
            Rearrange(&quot;b c (h p1) (w p2) -&amp;gt; b (h w) (p1 p2 c)&quot;, p1 = patch_height, p2 = patch_width),
            nn.LayerNorm(patch_dim),
            nn.Linear(patch_dim, dim),
            nn.LayerNorm(dim),
        )

        self.pos_embedding = posemb_sincos_2d(
            h = image_height // patch_height,
            w = image_width // patch_width,
            dim = dim,
        ) 

        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim)

        self.pool = &quot;mean&quot;
        self.to_latent = nn.Identity()

        self.linear_head = nn.Linear(dim, num_classes)

    def forward(self, img):
        device = img.device

        x = self.to_patch_embedding(img)
        x += self.pos_embedding.to(device, dtype=x.dtype)

        x = self.transformer(x)
        x = x.mean(dim = 1)

        x = self.to_latent(x)
        return self.linear_head(x)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;2-4. Additional Information&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Inductive Bias&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;ViT has a comparably lower inductive bias on images than CNNs. CNN considers the 2D neighborhood structure of an image and its inability to move throughout all layers. &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;In ViT, only the MLP layers act CNNs while self-attention layers are global. Therefore, ViT has to learn the spatial &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;information&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; and relational information of patches from scratch.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Hybrid Architecture&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;ViT can also utilize feature maps from CNNs as input instead of image patches. In this hybrid model, patch embedding projection $\mathbf{E}$ is applied to patches attained from CNN feature maps to convert them into a form suitable for ViT. Patches with 1x1 spatial size &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;are&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; also obtainable by flattening the spatial dimension of the feature map. The embeddings required for the Transformer are applied accordingly as explained above.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Fine-Tuning and Higher Resolution&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;ViT is pre-trained with a large dataset and later fine-tuned to smaller tasks. To do this pre-trained prediction head is replaced with a feed-forward layer with an initial value of 0. The amount of feed-forward layer is equivalent to the number of smaller tasks. If the resolution of an image increases, the effective sequence length increases as the patch size remains the same. In this case, the pre-trained position embeddings might become useless. To resolve this problem, 2D interpolation of the pre-trained position embeddings is performed.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;3. Experiment and Performance Analysis&lt;/b&gt;&lt;/h3&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;3-1. Setup&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Dataset&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;ViT was trained with large datasets, such as ImageNet-21k which has 14M images and 21K classes, or JFT which has 303M high-resolution images and 18K classes. Experiments have proven that ViT pre-trained in a large dataset, especially the JFT-300M dataset, performs extremely well. This result is because ViT lacks inductive bias compared to CNNs. &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Since ViT has to learn the spatial &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;information&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; and relational information of patches from scratch, &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;it requires&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; large data to make the model learn the image structure efficiently.&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; Pre-training ViT with large datasets and then fine-tuning to smaller benchmark datasets, such as, ImageNet and CIFAR-100 showed high performance, even comparable to SOTA CNNs.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;750&quot; data-origin-height=&quot;448&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/OHTq7/btsKuQLc2v9/45KTLwqd3wxJKtzXVxiTNK/img.webp&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/OHTq7/btsKuQLc2v9/45KTLwqd3wxJKtzXVxiTNK/img.webp&quot; data-alt=&quot;Figure 11: Visualization of the ImageNet Dataset in the Deep Lake UI [10].&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/OHTq7/btsKuQLc2v9/45KTLwqd3wxJKtzXVxiTNK/img.webp&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FOHTq7%2FbtsKuQLc2v9%2F45KTLwqd3wxJKtzXVxiTNK%2Fimg.webp&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;541&quot; height=&quot;323&quot; data-origin-width=&quot;750&quot; data-origin-height=&quot;448&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 11: Visualization of the ImageNet Dataset in the Deep Lake UI [10].&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Model Variants&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;ViT models are divided into three categories&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;, based on the size of the model&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;: Base, &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Large, and Huge&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;. ViT-Base and ViT-Large models are configured based on BERT, while ViT-Huge is a newly created model. ViT models are denoted based on the model size and patch size. For &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;example,&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;ViT model that is large&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; with a patch size of 16x16 is denoted ViT-L/16.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1113&quot; data-origin-height=&quot;242&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/QYXDu/btsKucIdRwk/yj4eBBnmjiGE38PVy6K5h1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/QYXDu/btsKucIdRwk/yj4eBBnmjiGE38PVy6K5h1/img.png&quot; data-alt=&quot;Table 1: Details of Vision Transformer Model Variants [1].&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/QYXDu/btsKucIdRwk/yj4eBBnmjiGE38PVy6K5h1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FQYXDu%2FbtsKucIdRwk%2Fyj4eBBnmjiGE38PVy6K5h1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;635&quot; height=&quot;138&quot; data-origin-width=&quot;1113&quot; data-origin-height=&quot;242&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Table 1: Details of Vision Transformer Model Variants [1].&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;The Base CNN model used in the experiment is ResNet, but with Group Normalization, instead of Batch Normalization. Standardized convolutions are also used to improve transfer learning. For the hybrid models, feature maps from CNNs are fed into ViT with a patch size of one pixel.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;Training and Fine-Tuning&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;color: #0e101a;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;For training every model, Adam optimizer with $\beta_1 = 0.9$, $\beta_2 = 0.999$ was used. Batch size and weight decay were set to 4096 and 0.1 respectively. Although SGD is more frequently used in ResNet training, Adam performed better in this study. Linear learning rate warmup and decay method was utilized. In the fine-tuning step, SGD with momentum was used with a batch size &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;of&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; 512. For the ImageNet dataset, &lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt;ViT-L/16&lt;/span&gt;&lt;span style=&quot;color: #0e101a;&quot; data-preserver-spaces=&quot;true&quot;&gt; model was fine-tuned to 512 resolution and 518 for ViT-H/14. Polyak &amp;amp; Juditsky averaging with a factor of 0.9999 was used to increase the model stability.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;3-2. Comparison to State of the Art&lt;/b&gt;&lt;/h4&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1878&quot; data-origin-height=&quot;584&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b7NKWc/btsKsgDMeYK/6JgpvE3k8ykc9BXEmzJUe1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b7NKWc/btsKsgDMeYK/6JgpvE3k8ykc9BXEmzJUe1/img.png&quot; data-alt=&quot;Table 2: Comparison With State of the Art Models on Popular Image Classification Benchmarks [1]. Numbers indicate the mean and standard deviation of the accuracies, averaged over three fine-tuning runs.&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b7NKWc/btsKsgDMeYK/6JgpvE3k8ykc9BXEmzJUe1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb7NKWc%2FbtsKsgDMeYK%2F6JgpvE3k8ykc9BXEmzJUe1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1878&quot; height=&quot;584&quot; data-origin-width=&quot;1878&quot; data-origin-height=&quot;584&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Table 2: Comparison With State of the Art Models on Popular Image Classification Benchmarks [1]. Numbers indicate the mean and standard deviation of the accuracies, averaged over three fine-tuning runs.&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;As can be seen in Table 2, the ViT model pre-trained with JFT-300M had higher accuracy than ResNet in every benchmark. Even comparably smaller ViT-L/16 already outperformed ResNet. The larger ViT model ViT-H/14 showed even higher performance in all the benchmarks, especially in more difficult benchmarks like ImageNet, CIFAR-100, and VTAB. The computational resources required for training ViT models were also much lower compared to those of ResNet and Noisy Student.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Although ViT-L/16 pre-trained on a slightly smaller dataset ImageNet-21K shows less accuracy than that of ResNet, it still shows fairly high performance considering the extremely low computational resources to train. Using this dataset, it only requires 30 days to train with 8-core TPUv3.&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2058&quot; data-origin-height=&quot;878&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/caeD6b/btsKrMwrn1t/ueq0nWMsTa252k7D5G0rpk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/caeD6b/btsKrMwrn1t/ueq0nWMsTa252k7D5G0rpk/img.png&quot; data-alt=&quot;Figure 12: Performance Versus Pre-Training Compute for Different Architectures: Vision Transformers, ResNets, and Hybrids [1].&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/caeD6b/btsKrMwrn1t/ueq0nWMsTa252k7D5G0rpk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcaeD6b%2FbtsKrMwrn1t%2Fueq0nWMsTa252k7D5G0rpk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2058&quot; height=&quot;878&quot; data-origin-width=&quot;2058&quot; data-origin-height=&quot;878&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Figure 12: Performance Versus Pre-Training Compute for Different Architectures: Vision Transformers, ResNets, and Hybrids [1].&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Also, hybrid models showed promising results in smaller models but became less significant as the model size grew. In summary, ViT models show better performance with fewer computational resources compared to CNN models. ViT models show high performance especially when pre-trained in large datasets.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;3-3. Limitations&lt;/b&gt;&lt;/h4&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Unlike the BERT in NLP, the self-supervised learning method underperforms compared to supervised learning in ViT. Since self-supervised learning is one of the reasons why Transformers became so popular in NLP, successfully applying self-supervised learning to ViT will be extremely beneficial. Also, although ViT shows outstanding performance in classification tasks, more tuning is required to apply ViT to other CV tasks like object detection and segmentation.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;4. Conclusion&lt;/b&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This research directly contributed to a new method of applying the Transformer model directly to image classification tasks. Unlike other research that added inductive biases into the architecture, this research divided an image into small patches and trained sequentially like the Transformer model in NLP. This methodology is extremely simple, yet very powerful when pre-trained in a large dataset. Consequently, ViT performs on par with other SOTA CNN models in classification tasks with relatively low training costs.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;As the ViT-based model (OmniVec) now holds state-of-the-art performance in image classification tasks (as of October 2024), understanding the archictecture and underlying concepts of ViT will be beneficial for those interested in the field of Computer Vision. The continued integration of different models and ideas will allow the relatively stagnant CV field to advance once more, as CNN architectures did.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;5. References&lt;/b&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In &lt;i&gt;ICLR&lt;/i&gt;, 2021.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;[2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All You Need. In &lt;i&gt;NIPS&lt;/i&gt;, 2017.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;[3] Amazon Web Services. What are Transformers in Artificial Intelligence? &lt;i&gt;AWS&lt;/i&gt;, 2024. URL &lt;a href=&quot;https://aws.amazon.com/what-is/transformers-in-artificial-intelligence/&quot;&gt;&lt;span&gt;https&lt;/span&gt;&lt;span&gt;://aws&lt;/span&gt;&lt;span&gt;.amazon&lt;/span&gt;&lt;span&gt;.com&lt;/span&gt;&lt;span&gt;/what&lt;/span&gt;&lt;span&gt;-is&lt;/span&gt;&lt;span&gt;/transformers&lt;/span&gt;&lt;span&gt;-in&lt;/span&gt;&lt;span&gt;-artificial&lt;/span&gt;&lt;span&gt;-intelligence/&lt;/span&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;[4] H2O.ai. Self-Attention Mechanism in Neural Networks. &lt;i&gt;H2O.ai Wiki&lt;/i&gt;, 2024. URL &lt;a href=&quot;https://h2o.ai/wiki/self-attention/&quot;&gt;&lt;span&gt;https&lt;/span&gt;&lt;span&gt;://h2o&lt;/span&gt;&lt;span&gt;.ai&lt;/span&gt;&lt;span&gt;/wiki&lt;/span&gt;&lt;span&gt;/self&lt;/span&gt;&lt;span&gt;-attention/&lt;/span&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;[5] Sumit Saha. A Comprehensive Guide to Convolutional Neural Networks &amp;mdash; The ELI5 Way. &lt;i&gt;Towards Data Science&lt;/i&gt;, 2018. URL &lt;a href=&quot;https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53&quot;&gt;&lt;span&gt;https&lt;/span&gt;&lt;span&gt;://towardsdatascience&lt;/span&gt;&lt;span&gt;.com&lt;/span&gt;&lt;span&gt;/a&lt;/span&gt;&lt;span&gt;-comprehensive&lt;/span&gt;&lt;span&gt;-guide&lt;/span&gt;&lt;span&gt;-to&lt;/span&gt;&lt;span&gt;-convolutional&lt;/span&gt;&lt;span&gt;-neural&lt;/span&gt;&lt;span&gt;-networks&lt;/span&gt;&lt;span&gt;-the&lt;/span&gt;&lt;span&gt;-eli5&lt;/span&gt;&lt;span&gt;-way&lt;/span&gt;&lt;span&gt;-3bd2b1164a53&lt;/span&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;[6] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In &lt;i&gt;ICML&lt;/i&gt;, 2018.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;[7] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. &lt;i&gt;arXiv&lt;/i&gt;, 2019.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;[8] Dirk Weissenborn, Oscar Tackstr &amp;uml; om, and Jakob Uszkoreit. Scaling autoregressive video models. In &lt;i&gt;ICLR&lt;/i&gt;, 2019.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;[9] Lucidrains. Simple ViT: A Simple Implementation of Vision Transformers in PyTorch. &lt;i&gt;GitHub&lt;/i&gt;, 2023. URL &lt;a href=&quot;https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/simple_vit.py&quot;&gt;&lt;span&gt;https&lt;/span&gt;&lt;span&gt;://github&lt;/span&gt;&lt;span&gt;.com&lt;/span&gt;&lt;span&gt;/lucidrains&lt;/span&gt;&lt;span&gt;/vit&lt;/span&gt;&lt;span&gt;-pytorch&lt;/span&gt;&lt;span&gt;/blob&lt;/span&gt;&lt;span&gt;/main&lt;/span&gt;&lt;span&gt;/vit_pytorch&lt;/span&gt;&lt;span&gt;/simple_vit&lt;/span&gt;&lt;span&gt;.py&lt;/span&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span&gt;[10] Activeloop. ImageNet Dataset. &lt;i&gt;Activeloop Datasets Documentation&lt;/i&gt;, 2024. URL &lt;a href=&quot;https://datasets.activeloop.ai/docs/ml/datasets/imagenet-dataset/&quot;&gt;&lt;span&gt;https&lt;/span&gt;&lt;span&gt;://datasets&lt;/span&gt;&lt;span&gt;.activeloop&lt;/span&gt;&lt;span&gt;.ai&lt;/span&gt;&lt;span&gt;/docs&lt;/span&gt;&lt;span&gt;/ml&lt;/span&gt;&lt;span&gt;/datasets&lt;/span&gt;&lt;span&gt;/imagenet&lt;/span&gt;&lt;span&gt;-dataset/&lt;/span&gt;&lt;/a&gt;.&lt;/span&gt;&lt;/p&gt;</description>
      <category>Literature Review</category>
      <author>Jonathan Lee</author>
      <guid isPermaLink="true">https://jonathanlee.tistory.com/1</guid>
      <comments>https://jonathanlee.tistory.com/1#entry1comment</comments>
      <pubDate>Mon, 28 Oct 2024 15:01:34 +0900</pubDate>
    </item>
  </channel>
</rss>