Swin Transformer

this week I was studying a really interesting paper called Swin Transformer, from Microsoft Research Asia.

and the main point of the paper is simple:

how can we use transformers in computer vision in a really efficient way?

we already know several classic architectures based on convolutional neural networks, like AlexNet, VGG, ResNet, DenseNet and others.

for a long time, these models were used as backbones for computer vision tasks.

but what is a backbone?

when we talk about computer vision, we are usually talking about tasks like object detection, segmentation, pose estimation, image classification, image generation, style transfer and so on.

in almost all of these tasks, one of the first steps is to extract information from the image.

in other words, transform the image into a representation that the model can use later.

this representation is what we call features.

so the backbone is basically this:

a network that receives an image as input and transforms that image into useful features to solve a problem.

for example, in a network like VGG or ResNet, you pass an image through the model, and the network starts extracting patterns at different levels.

first edges and textures.

then shapes.

then parts of objects.

and at the end, it generates a more abstract representation of that image.

one common characteristic of CNN-based architectures is that, as we go deeper into the layers, the number of channels increases.

for example:

64, then 128, then 256, then 512.

this allows the network to capture more complex information.

and when we look at the Swin Transformer, we see that it tries to create something similar, but using transformers.

the architecture has four main parts:

patch partition

linear embedding

swin transformer block

patch merging

if we understand these four parts, we understand most of the idea of the paper.

first comes patch partition.

the image is divided into small pieces, called patches.

in the paper, they use patches of 4x4 pixels.

since an image has 3 channels, RGB, each patch has a dimension of 4x4x3, which means 48 values.

then comes linear embedding.

this step takes each 48-dimensional patch and transforms it into a vector with dimension C.

in the paper, this C can be 96, 128, or 192, depending on the version of the model.

in a simple way:

the model takes small pieces of the image and turns each piece into a more useful numerical representation.

then we have patch merging.

this part is important because it creates a hierarchical structure, similar to what happens in CNNs.

instead of keeping all patches separate all the time, Swin starts merging neighboring patches.

for example, it takes a group of 2x2 patches and turns them into one larger patch.

with this, the number of patches decreases.

but at the same time, the number of channels increases.

so basically:

fewer patches, but more information in each representation.

this is similar to the logic of CNNs, where the spatial resolution gets smaller and the channel depth gets larger.

now comes the most important part of the paper:

the Swin Transformer Block.

the problem with traditional transformers in computer vision is the computational cost.

in NLP, the transformer looks at all the tokens in a sentence.

but in images, if you divide an image into many patches, calculating attention between all patches becomes very expensive.

because each patch would need to pay attention to every other patch in the image.

this does not scale well.

so the Swin Transformer proposes a solution:

instead of calculating attention over the entire image, it divides the image into smaller windows.

inside each window, the patches calculate attention only with each other.

this is called Window Multi-Head Self-Attention, or W-MSA.

with this, the model becomes much more efficient.

but there is one problem.

if each window only looks inside itself, patches from different windows do not communicate with each other.

and this limits the model.

so the paper proposes a second idea:

shifted windows.

basically, the windows are shifted.

in one layer, the model calculates attention inside normal windows.

in the next layer, it shifts the windows and calculates attention again.

with this, patches that were in separate windows before can now connect with each other.

this is one of the biggest ideas of the Swin Transformer.

it keeps the computational cost low, but still allows communication between different regions of the image.

in a simple way:

first, it looks locally.

then, it shifts the windows to connect different regions better.

this reminds me a little of CNNs, where the kernel moves across the image and captures local patterns.

but here, this is done with attention.

when we put everything together, we get an architecture with four stages.

in each stage, the model processes the patches, applies the Swin Transformer blocks, and then uses patch merging.

with this, it creates a hierarchical representation of the image.

and this is exactly what makes Swin Transformer so strong as a backbone for computer vision.

it is not useful only for classification.

it can also be used for object detection, segmentation and other visual tasks.

in the results of the paper, Swin Transformer shows very strong performance compared to models like ViT, DeiT, ResNet and even EfficientNet in some scenarios.

in classification, it gets very close to the best convolutional models.

in object detection, using datasets like COCO, it improves the results when used as a backbone in frameworks like Mask R-CNN, Cascade Mask R-CNN, ATSS and Sparse R-CNN.

in segmentation, on ADE20K, it also shows very strong results.

so the big promise of Swin Transformer is this:

to create a transformer-based architecture that can work as a general backbone for computer vision.

not only for classification.

but for many different visual tasks.

for me, the most interesting part of this paper is that it shows an important transition.

for a long time, CNNs dominated computer vision.

but transformers started to take space in this field too.

but for this to work well with images, it is not enough to just take the traditional transformer and apply it directly.

you need to adapt the architecture.

and Swin does this really well.

it takes the attention idea from transformers, combines it with a hierarchical structure similar to CNNs, and solves part of the computational cost problem using shifted windows.

in the end, Swin Transformer is not just another model.

it is an attempt to create a new standard backbone for computer vision.

and maybe this is the most important point of the paper.

Antonioni Nascimento

Software, cybersecurity, data & AI. Building advanced systems at KRX. Writing about technology, venture capital, and how software shapes markets.