A Novel Swin Transformer Based Deep Learning Model for Building Extraction from Very High Resolution Images

Kavzoglu, T. and Yilmaz, E.O

A Novel Swin Transformer Based Deep Learning Model for Building Extraction from Very High Resolution Images
Kavzoglu, T. and Yilmaz, E.O

Gebze Technical University

Abstract

The extraction of building footprints from very high-resolution remote sensing imagery plays a vital role in a wide range of geospatial applications, including spatial planning, crisis management, and the development of data-driven smart cities. While deep learning-based approaches have significantly enhanced the accuracy and automation of this task in recent years, several challenges persist. These challenges are especially prominent in densely built environments, where complex urban morphology and spectrally similar surface materials hinder precise segmentation. The issue of delineating building boundaries is frequently impeded by these factors, thus necessitating the development of more robust and context-aware segmentation strategies. In this study, a novel Swin Transformer-based model was proposed for building extraction, and its performance was tested on a well-known benchmark dataset, namely the Massachusetts Building Dataset. The model aims to accurately identify building boundaries by effectively capturing local textural details and global contextual information through a multi-scale, window-based attention mechanism. The performance of the model is benchmarked against SOTA deep learning architectures, including DeepLabV3+, SegFormer, UPerNet, and PAN, which underwent training and testing under the same dataset and parameter settings. The results revealed that the proposed model exhibited superior performance in terms of evaluation metrics. To be more specific, the proposed model demonstrated a precision of 87.98%, a recall of 86.03%, an IoU of 77.94% and an overall accuracy of 92.54%. On the other hand, SegFormer, UPerNet, DeepLabV3+, and PAN achieved IoU scores of 75.41%, 75.66%, 73.36%, and 69.78%, respectively. The findings indicate that the proposed model is capable of delineating more precise building boundaries, particularly in areas characterized by high-density construction, and demonstrates a strong capacity for generalization. Moreover, results show that transformer-based architectures offer a powerful alternative for remote sensing and geospatial artificial intelligence applications, providing more lightweight, accurate, and scalable solutions for building extraction.

Keywords: Building footprint extraction, semantic segmentation, swin transformer, VHR imagery, remote sensing.

Topic: Topic B: Applications of Remote Sensing

ACRS 2025 Conference | Conference Management System