BuildingWorld: A Global Structured 3D Building Dataset for Urban Foundation Models

New dataset, available free of charge to researchers, includes five million LoD2 buildings spread over five continents, as well as Cyber City, a fully procedural virtual city engine.

BuildingWorld1 is the first large-scale global LoD2 building reconstruction dataset, designed to provide a unified, systematic, and highly diverse benchmark for 3D urban modeling and building reconstruction research. The dataset integrates real and high-quality simulated aerial lidar point clouds from multiple continents, encompassing diverse urban morphologies and architectural styles, and aligns them with structured building models in a one-to-one manner. This unprecedented scale and geographic diversity enable comprehensive evaluation and learning-based approaches for 3D building reconstruction. A paper describing the BuildingWorld dataset has been accepted by the 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026).

Introduction

When a city is struck by flooding or a large-scale power outage, the decisions made in the subsequent minutes can determine millions of dollars in economic losses and, in some cases, the trajectory of public safety. Even under normal conditions, cities must continuously address complex operational challenges—how to coordinate transportation during extreme cold events, how to plan the deployment of solar energy infrastructure, or how to monitor increasingly aging urban facilities.

Figure1

Figure 1: Statistical overview of the BuildingWorld dataset. Area bars indicate scene sizes, while percentage area and height metrics highlight the diversity of building structures.

Digital twins offer a promising solution by replicating the operational state of a city in virtual space, enabling predictive analysis and informed decision-making for urban management. At the core of any urban digital twin system lies a large-scale three-dimensional city model, with accurate building geometry serving as its fundamental backbone. When building shapes or spatial layouts are imprecise or misaligned, the effectiveness and reliability of digital twin systems are significantly compromised.

With the rapid advance of artificial intelligence, the primary challenge in constructing city-scale digital twin systems has gradually shifted from algorithmic design to data availability. Today, the true bottleneck lies in the lack of high-fidelity 3D building data that can be directly used for training learning-based models. This data scarcity severely limits the generalization ability of existing modeling approaches across different cities and scenarios and also hinders progress in related research fields.

BuildingWorld is a precise response to this challenge. The project focuses on building a large-scale, high-quality, and globally diverse 3D building dataset to meet the growing demands of next-generation world models for data scale, structural consistency, and geographic diversity. By providing reliable and structured 3D building data, BuildingWorld aims to support the development of urban-scale AI systems and advance the practical deployment of digital twin technologies.

High-quality 3D models

BuildingWorld aggregates approximately five million LoD2 building models from 44 cities across five continents, including North America, Europe, Asia, Africa, and Oceania, making it one of the largest and most geographically diverse 3D building datasets currently available. Each building is represented in a structured geometric form using a unified data format and includes detailed roof structures, providing critical support for building reconstruction, urban simulation, and AI-driven data analysis and processing.

Figure2

Figure 2: Illustration of the construction process of the BuildingWorld dataset. A glimpse of the LoD2 digital city model of Boston is shown. The zoomed-in downtown area illustrates simulated aerial lidar point clouds, generated using a predefined airborne platform, lidar sensor, and flight trajectory.

As illustrated in Figure 1, the variations in building footprint area and height distributions across different cities clearly reflect their distinct spatial morphologies and urban structures. These differences also highlight the rich diversity of building types and architectural forms captured by the dataset. Owing to its extensive coverage, BuildingWorld not only serves as a solid geometric foundation for constructing high-fidelity digital twin systems but also provides essential data conditions for the training and evaluation of next-generation 3D foundation models, enabling robust performance across cities, regions, and even continents.

In addition to the LoD2 data, BuildingWorld includes LoD3 building models for selected cities such as Hong Kong and Greater Geelong, offering enhanced geometric detail. These higher-fidelity models further support tasks that require more precise building reconstruction and fine-grained evaluation.

Real and simulated lidar

BuildingWorld provides both real-world and high-quality simulated aerial lidar point clouds. As illustrated in Figure 2, the simulator performs virtual laser scanning over city models by moving a simulated lidar sensor along predefined flight trajectories at specified speeds. During simulation, key factors are systematically taken into account, including occlusion effects, sensor viewing angles, flight speed, pulse repetition frequency, scan rate, and flight altitude. As a result, the generated simulated point clouds closely reproduce structural patterns and incompleteness commonly observed in real aerial lidar data, such as spatial distribution characteristics, point sparsity, and local missing regions.

Figure3

Figure 3: Cyber City consists of four main components: terrain, road and building footprints, buildings, and vegetation.

Furthermore, when applicable, tree models are incorporated into the urban scenes to simulate building occlusions and environmental interference more realistically. This design enables the simulated point clouds to reflect real-world scanning conditions more faithfully, further narrowing the gap between simulated and real data.

In addition to the simulated data, several government agencies have released real aerial lidar point clouds for the same cities. However, these scans are often collected at different times and are not always spatially aligned with the corresponding building models, which makes them unsuitable for direct use in supervised deep learning. BuildingWorld therefore performs unified preprocessing and standardization on these real-world point clouds, enabling their use in domain adaptation, model validation, and reinforcement learning scenarios. This integration further enhances the practical applicability and robustness of learning-based methods in real-world deployment settings.

Cyber City: Generating unlimited virtual urban worlds

Beyond real-world city models, BuildingWorld introduces Cyber City, a fully procedural virtual city engine designed to recombine globally diverse architectural styles and generate novel urban configurations. As illustrated in Figure 3, Cyber City consists of multiple procedurally generated components, including terrain, road networks and parcel layouts, building placements, and vegetation.

Figure4

Figure 4: Examples from the BuildingWorld dataset.

It is important to emphasize that the goal of Cyber City is not to create new digital cities that aim to replace real-world urban environments. Instead, by systematically varying terrain morphology, vegetation distributions, and building layout strategies, Cyber City deliberately constructs point cloud distributions with a high degree of diversity. This design substantially expands the diversity of training data in terms of spatial and structural distributions, thereby strengthening the adaptability of AI models to real-world scenarios and improving the robustness of 3D building reconstruction algorithms under varying environmental conditions.

Dataset overview

Figure 4 illustrates a building model from each of several of the cities in the BuildingWorld dataset, revealing the wide stylistic and functional diversity of structures, ranging from dense residential neighborhoods and urban commercial districts to industrial areas and leisure-oriented facilities.

Shangfeng HuangShangfeng Huang is currently working on his PhD. As a member of Intelligent Geospatial Data Mining Lab at the University of Calgary, he focuses his research on 3D building reconstruction from point clouds.

Ruisheng WangDr. Ruisheng Wang is a distinguished professor in Shenzhen University. He was a professor in the Department of Geomatics Engineering at the University of Calgary from 2012 to 2024. Before that, he worked from 2008 as an industrial researcher at HERE Technologies (formerly NAVTEQ) in Chicago, USA. His primary research focus there was mobile lidar data processing for next-generation map-making and navigation. Dr. Wang holds a PhD in Electrical and Computer Engineering from McGill University, an MScE in Geomatics Engineering from the University of New Brunswick, and a BEng in Photogrammetry and Remote Sensing from Wuhan University.

Xin WangDr. Xin Wang is a Professor and Schulich Research Chair in the Department of Geomatics Engineering at the University of Calgary. She joined the Department of Geomatics Engineering in July 2007. She holds a BSc in Computer Science, MSc in Software Engineering from Northwest University, China and a PhD in Computer Science from the University of Regina. Her current research interests are spatial databases and spatial data mining, big spatio-temporal data analytics, artificial intelligence and machine learning for spatial applications, data mining for transportation, oil and gas, ontology and knowledge engineering in GIS, web GIS and location-based social networks. She was an executive member and treasurer of the Canadian Artificial Intelligence Association (CAIAC) from 2015 to 2021.

1 https://szusic.github.io/BuildingWorld

2 Huang, S., R. Wang and X. Wang, 2026. BuildingWorld: A structured 3D building dataset for urban foundation models, 40th AAAI Conference on Artificial Intelligence (AAAI 2026), Singapore, January 20-27.