Implementation Of The Swin Transformer and Its Application In Image Classification



  • Rasha. A. Dihin Department of Computer Science, University of Kufa, Najaf, Iraq,
  • Ebtesam N. Al Shemmary IT Research and Development Center, University of Kufa, Najaf, Iraq
  • Waleed A. Mahmoud Al-Jawher Collge of Engenering, Uruk University, Baghdad, Iraq

There are big differences between the field of view of the calculator and the field of natural languages, for example, in the field of vision, the difference is in the size of the object as well as in the accuracy of the pixels in the image, and this contradicts the words in the text, and this makes the adaptation of the transformers to see somewhat difficult.Very recently a vision transformer named Swin Transformer was introduced by the Microsoft research team in Asia to achieve state-of-the-art results for machine translation. The computational complexity is linear and proportional to the size of the input image, because the processing of subjective attention is within each local window separately, and thus results in processor maps that are hierarchical and in deeper layers, and thus serve as the backbone of the calculator's vision in image classification and dense recognition applications. This work focuses on applying the Swin transformer to a demonstrated mathematical example with step-by-step analysis. Additionally, extensive experimental results were carried out on several standardized databases from CIFAR-10, CIFAR-100, and MNIST. Their results showed that the Swin Transformer can achieve flexible memory savings. Test accuracy for CIFAR-10 gave a 71.54% score, while for the CIFAR-100 dataset the accuracy was 46.1%. Similarly, when the Swin transformer was applied to the MNIST dataset, the accuracy increased in comparison with other vision transformer results.   


Image classification, Object detection, Swin transformer ST, Vision transformer ViT

[1] H. Wang, P. Cao, J. Wang, and O. R. Zaiane, “UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer,” arXiv Prepr. arXiv2105.05537, 2021, [Online]. Available:

[2] W. Wang, E. Xie, X. Li, and D.-P. Fan, “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions arXiv:2102.12122v2,” arXiv:2102.12122v2 [cs.CV], 2021.

[3] J. Yang, C. Li, P. Zhang, X. Dai, and B. Xiao, “Focal Attention for Long-Range Interactions in Vision Transformers,” NeurIPS (Spotlight). pp. 1–21, 2021.

[4] D. Lu, J. Wang, Z. Zeng, B. Chen, S. Wu, and S.-T. Xia, “SwinFGHash: Fine-grained Image Retrieval via Transformer-based Hashing Network,” Bmvc. 2021.

[5] H. Song, D. Sun, S. Chun, and V. Jampani, “An Extendable, Efficient and Effective Transformer-based Object Detector,” arXiv:2204.07962v1, 2022.

[6] Z. Liu, Y. Lin, Y. Cao, H. Hu, and Y. Wei, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows arXiv:2103.14030v2,” arXiv:2103.14030v2, 2021.

[7] L. Wang, R. Li, C. Duan, C. Zhang, and X. Meng, “A Novel Transformer based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images,” Geosci. Remote Sens. Lett., 2021.

[8] Y. Gu, Z. Piao, and S. J. Yoo, “STHarDNet: Swin Transformer with HarDNet for MRI Segmentation,” Appl. Sci., 2022.

[9] Z. Liao, N. Fan, and K. Xu, “Swin Transformer Assisted Prior Attention Network for Medical Image Segmentation,” Appl. Sci., 2022.

[10] J. Liang, J. Cao, G. Sun, and K. Zhang, “SwinIR: Image Restoration Using Swin Transformer,” arXiv:2108.10257v1, 2021.

[11] S. Hao, B. Wu, K. Zhao, and Y. Ye, “Two-Stream Swin Transformer with Differentiable Sobel Operator for Remote Sensing Image Classification,” Remote Sens., 2022.

[12] A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. Roth, and D. Xu, “Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images.” 2022, [Online]. Available:

[13] H. Wu, B. Xiao, N. Codella, and M. Liu, “CvT: Introducing Convolutions to Vision Transformers Haiping,” IEEE, 2021.

[14] L. Yan, J. Huang, H. Xie, P. Wei, and Z. Gao, “Efficient Depth Fusion Transformer for Aerial Image Semantic Segmentation,” Remote Sens., 2022.

[15] Z. Liu et al., “Video Swin Transformer.” 2021, [Online]. Available:

[16] W. Wang, L. Yao, L. Chen, and B. Lin, “CROSSFORMER: A VERSATILE VISION TRANSFORMER HINGING ON CROSS-SCALE ATTENTION Wenxiao,” arXiv:2108.00154v2, 2021.

[17] J. Xu, X. Sun, Z. Zhang, and G. Zhao, “Understanding and Improving Layer Normalization Jingjing,” Conf. Neural Inf. Process. Syst., 2019.

[18] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” arXiv:1607.06450v1. 2016.

[19] I. Tolstikhin et al., “MLP-Mixer: An all-MLP Architecture for Vision.” 2021, [Online]. Available:

[20] J. Guo, K. Han, H. Wu, and C. Xu, “CMT: Convolutional Neural Networks Meet Vision Transformers Jianyuan,” arXiv:2107.06263v2, 2021.

[21] J. Ahn, J. Hong, J. Ju, and H. Jung, “Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints,” arXiv:2111.10017v1, 2021, [Online]. Available:

[22] E. In, “Lmsa: Low-Relation Mutil-Head Self- Attention Mechanism in Visual Transformer,” pp. 1–11, 2022.

[23] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms,” arXiv:1708.07747v2, 2017, [Online]. Available:

[24] Mr Hamid M Hasan, Waleed A. Mahmoud Al- Jawher, Majid A Alwan “3-d face recognition using improved 3d mixed transform” Journal International Journal of Biometrics and Bioinformatics (IJBB), Volume 6, Issue 1, Pages 278-290, 2012.

[25] Waleed A. Mahmoud, MS Abdulwahab, HN Al-Taai “The Determination of 3D Multiwavelet Transform” IJCCCE, Volume 2, Issue 4, pages 28-46 2005.

[26] Waleed Ameen Mahmoud “A Smart Single Matrix Realization of Fast Walidlet Transform” Journal International Journal of Research and Reviews in Computer Science, Volume 2, Issue, 1, Pages 144-151, 2011.

[27] Abbas Hasan Kattoush, Waleed Ameen Mahmoud Al-Jawher, Osama Q Al-Thahab “A radon-multiwavelet based OFDM system design and simulation under different channel conditions” Journal of Wireless personal communications, Volume 71, Pages 857-871, 2013.

[28] . Hadeel Al-Taai Walid Mahmoud, Mutaz Abdulwahab “New fast method for computing multiwavelet coefficients from 1D up to 3D” Proc. 1st Int. Conference on Digital Comm. & Comp. App., Jordan, Pages 412-2

[29] Abbas H Kattoush, Waleed A Mahmoud, Ali Shaheen, Ahed Ghodayyah “The performance of proposed one dimensional serial Radon based OFDM system under different channel conditions” The International Journal of Computers, Systems and Signals, Volume 9, Issue 2, Pages 412-422, 2008.

[30] Walid A Mahmoud, Majed E Alneby, Wael H Zayer “2D-multiwavelet transform 2D-two activation function wavelet network-based face recognition” J. Appl. Sci. Res, vol. 6, issue 8, 1019-1028, 2010.

[31] Waleed A Mahmoud, MR Shaker “3D Ear Print Authentication using 3D Radon Transform” proceeding of 2nd International Conference on Information & Communication Technologies, Pages 1052-1056, 2006.

[32] Waleed A Mahmoud, Afrah Loay Mohammed Rasheed “3D Image Denoising by Using 3D Multiwavelet” AL-Mustansiriya J. Sci, vol 21, issue 7, pp. 108-136, 2010.

[33] AHM Al-Heladi, W. A. Mahmoud, HA Hali, AF Fadhel “Multispectral Image Fusion using Walidlet Transform” Advances in Modelling and Analysis B, vol 52, issue 1-2, pp. 1-20, 2009.

[34] W. A. Mahmoud & I.K. Ibraheem “Image Denoising Using Stationary Wavelet Transform” Signals, Inf. Patt. Proc. & Class., vol 46, issue 4, pp. 1-18, 2003.

A. Dihin, R. ., Al Shemmary, E. N. ., & Al-Jawher, W. A. M. . (2023). Implementation Of The Swin Transformer and Its Application In Image Classification. Journal Port Science Research, 6(4), 318–331.


Download data is not yet available.

Most read articles by the same author(s)