Topology has long been a key GIS requirement for data management and integrity. In general, a topological data model manages spatial relationships by representing spatial objects (point, line, and area features) as an underlying graph of topological primitives—nodes, faces, and edges. These primitives, together with their relationships to one another and to the features whose boundaries they represent, are defined by representing the feature geometries in a planar graph of topological elements.
Topology is fundamentally used to ensure data quality of the spatial relationships and to aid in data compilation. Topology is also used for analyzing spatial relationships in many situations such as dissolving the boundaries between adjacent polygons with the same attribute values or traversing along a network of the elements in a topology graph.
Topology can also be used to model how the geometry from a number of feature classes can be integrated. Some refer to this as
vertical integration of feature classes.
Generally, topology is employed to do the following:
- Manage coincident geometry (constrain how features share geometry). For example, adjacent polygons, such as parcels, have shared edges; street centerlines and the boundaries of census blocks have coincident geometry; adjacent soil polygons share edges; etc.
- Define and enforce data integrity rules (such as no gaps should exist between parcel features, parcels should not overlap, road centerlines should connect at their endpoints).
- Support topological relationship queries and navigation (for example, to provide the ability to identify adjacent and connected features, find the shared edges, and navigate along a series of connected edges).
- Support sophisticated editing tools that enforce the topological constraints of the data model (such as the ability to edit a shared edge and update all the features that share the common edge).
- Construct features from unstructured geometry (e.g., the ability to construct polygons from lines sometimes referred to as "spaghetti").
NOTE: Reading this large topic is not necessary to implement geodatabase topologies. However, you may want to spend some time reading this if you are interested in the historical evolution and motivations for how topology is managed in the geodatabase.
The genesis of "Arc-node" and "Geo-relational"
ArcInfo coverage users have a long history and appreciation for the role that topology plays in maintaining the spatial integrity of their data.
Here are the elements of the ArcInfo coverage data model.
In a coverage, the feature boundaries and points were stored in a few main files that were managed and owned by ArcInfo Workstation. The "ARC" file held the linear or polygon boundary geometry as topological edges, which were referred to as "arcs." The "LAB" file held point locations, which were used as label points for polygons or as individual point features such as for a wells feature layer. Other files were used to define and persist the topological relationships between each of the edges and the polygons.
For example, one file called the "PAL" file ("Polygon-arc list") listed the order and direction of the arcs in each polygon. In ArcInfo, software logic was used to assemble the coordinates for each polygon for display, analysis, and query operations. The ordered list of edges in the PAL file was used to look up and assemble the edge coordinates held in the ARC file. The polygons were assembled during run time when needed.
The coverage model had several advantages:
- It used a simple structure to maintain topology.
- It enabled edges to be digitized and stored only once and shared by many features.
- It could represent polygons of enormous size (with thousands of coordinates) because polygons were really defined as an ordered set of edges (called arcs).
- The Topology storage structure of the coverage was intuitive. Its physical topological files were readily understood by ArcInfo users.
NOTE: An interesting historical fact: "Arc," when coupled with the table manager named "Info," was the genesis of the product name ArcInfo and hence all subsequent "Arc" products in the ESRI product family—ArcView, ArcIMS, ArcGIS, etc.
Coverages also had some disadvantages:
- Some operations were slow because many features had to be assembled on the fly when they needed to be used. This included all polygons and multipart features such as regions (the coverage term for multipart polygons) and routes (the term for multipart line features).
- Topological features (such as polygons, regions, and routes) were not ready to use until the coverage topology was built. If edges were edited, the topology had to be rebuilt. (Note: "Partial processing" was eventually used, which required rebuilding only the changed portions of the coverage topology.) In general, when edits are made to features in a topological dataset, a geometric analysis algorithm must be executed to rebuild the topological relationships regardless of the storage model.
- Coverages were limited to single-user editing. Because of the need to ensure that the topological graph was in synchronization with the feature geometries, only a single user could update a topology at a time. Users would tile their coverages and maintain a tiled database for editing. This enabled individual users to "lock down" and edit one tile at a time. For general data use and deployment, users would append copies of their tiles into a mosaicked data layer. In other words, the tiled datasets they edited were not directly used across the organization. They had to be converted, which meant extra work and extra time.
Shapefiles and simple geometry storage
In the early 1980s, coverages were seen as a major improvement over the older polygon and line-based systems in which polygons were held as complete loops. In these older systems, all of the coordinates for a feature were stored in each feature's geometry. Before the coverage and ArcInfo came along, these simple polygon and line structures were used. These data structures were simple, but had the disadvantage of double digitized boundaries. That is, two copies of the coordinates of the adjacent portions of polygons with shared edges would be contained in each polygon's geometry. The main disadvantage was that GIS software at the time could not maintain shared edge integrity. Plus, storage costs were enormous and each byte of storage came at a premium. During the early 1980s, a 300 MB disk drive was the size of a washing machine and cost $30,000! Holding two or more representations of coordinates was expensive, and the computations took too much compute time. Thus, the use of a coverage topology had real advantages.
During the mid-1990s, interest in simple geometric structures grew because disk storage and hardware costs in general were coming down while computational speed was growing. At the same time, existing GIS datasets were more readily available, and the work of GIS users was evolving from primarily data compilation activities to include data use, analysis, and sharing.
Users wanted faster performance for data use (for example, don't spend computer time to derive polygon geometries when we need them. Just deliver the feature coordinates of these 1,200 polygons as fast as possible). Having the full feature geometry readily available was more efficient. Thousands of geographic information systems were in use, and numerous datasets were readily available.
Around this time, ESRI had developed and published its ESRI shapefile format. Shapefiles used a very simple storage model for feature coordinates. Each shapefile represented a single feature class (of points, lines, or polygons) and used a simple storage model for the feature's coordinates. Shapefiles could be easily created from ArcInfo coverages as well as many other GIS systems. They were widely adopted as a de facto standard and are still massively used and deployed to this day.
A few years later, ArcSDE pioneered a similar simple storage model in relational database tables. A feature table could hold one feature per row with the geometry in one of its columns along with other feature attribute columns.
A sample feature table of state polygons is shown below. Each row represents a state. The SHAPE column holds the polygon geometry of each state.
This simple features model fits the SQL processing engine very well. Through the use of relational databases, we began to see GIS data scale to unprecedented sizes and numbers of users without degrading performance. We were beginning to leverage RDBMS for GIS data management.
Shapefiles became ubiquitous and, using ArcSDE, this simple features mechanism became the fundamental feature storage model in RDBMSs. (To support interoperability, ESRI was the lead author of the OGC and ISO simple features specification.)
Simple feature storage had clear advantages:
- The complete geometry for each feature is held in one record. No assembly is required.
- The data structure (physical schema) is very simple, fast, and scalable.
- It is easy for programmers to write interfaces.
- It is interoperable. Many wrote simple converters to move data in and out of these simple geometries from numerous other formats. Shapefiles were widely applied as a data use and interchange format.
Its disadvantages were that maintaining the data integrity that was readily provided by topology was not as easy to implement for simple features. As a consequence, users applied one data model for editing and maintenance (such as coverages) and used another for deployment (such as shapefiles or ArcSDE layers).
Users began to use this hybrid approach for editing and data deployment. For example, users would edit their data in coverages, CAD files, or other formats. Then, they would convert their data into shapefiles for deployment and use. Thus, even though the simple features structure was an excellent direct use format, it did not support the topological editing and data management of shared geometry. Direct use databases would use the simple structures, but another topological form was used for editing. This had advantages for deployment. But the disadvantage was that data would become out of date and have to be refreshed. It worked, but there was a lag time for information update. Bottom line—topology was missing.
What GIS required and what the geodatabase topology model implements now is a mechanism that stores features using the simple feature geometry, but enables topologies to be used on this simple, open data structure. This means that users can have the best of both worlds—a transactional data model that enables topological query, shared geometry editing, rich data modeling, and data integrity, but also a simple, highly scalable data storage mechanism that is based upon open, simple feature geometry.
This direct use data model is fast, simple, and efficient. It can also be directly edited and maintained by any number of simultaneous users.
The topology framework in ArcGIS
In effect, topology has been considered as more than a data storage problem. The complete solution includes
- A complete data model (objects, integrity rules, editing and validation tools, a topology and geometry engine that can process datasets of any size and complexity, and a rich set of topological operators, map display, and query tools)
- An open storage format using a set of record types for simple features and a topological interface to query simple features, retrieve topological elements, and navigate their spatial relationships (e.g., find adjacent areas and their shared edge, route along connected lines)
- The ability to provide the features (points, lines, and polygons) as well as the topological elements (nodes, edges, and faces) and their relationships to one another
- A mechanism that can support
- Massively large datasets with millions of features
- Ability to perform editing and maintenance by many simultaneous editors
- Ready-to-use, always-available feature geometry
- Support for topological integrity and behavior
- A system that goes fast and scales for many users and many editors
- A system that is flexible and simple
- A system that leverages the RDBMS SQL engine and transaction framework
- A system that can support multiple editors, long transactions, historical archiving, and replication
In a geodatabase topology, the validation process identifies shared coordinates between features (both in the same feature class and across feature classes). A clustering algorithm is used to ensure that the shared coordinates have the same location. These shared coordinates are stored as part of each feature's simple geometry.
This enables very fast and scalable lookup of topological elements (nodes, edges, and faces). This has the added advantage of working quite well and scaling with the RDBMS's SQL engine and transaction management framework.
During editing and update, as features are added, they are directly usable. The updated areas on the map, called "dirty areas," are flagged and tracked as updates are made to each feature class. At any time, users can choose to topologically analyze and validate the dirty areas to generate clean topology. Only the topology for the dirty areas needs rebuilding, saving processing time.
The results are that topological primitives (nodes, edges, faces) and their relationships to one another and their features can be efficiently discovered and assembled. This has several advantages:
- Simple feature geometry storage is used for features. This storage model is open, efficient, and scales to large sizes and numbers of users.
- This simple features data model is transactional and is multiuser. By contrast, the older topological storage models will not scale and have difficulties supporting multiple editor transactions and numerous other GIS data management workflows.
- Geodatabase topologies fully support all the long transaction and versioning capabilities of the geodatabase. Geodatabase topologies need not be tiled, and many users can simultaneously edit the topological database—even their individual versions of the same features if necessary.
- Feature classes can grow to any size (hundreds of millions of features) with very strong performance.
- This topology implementation is additive. You can typically add this to an existing schema of spatially related feature classes. The alternative is that you must redefine and convert all of your existing feature classes to new data schemas holding topological primitives.
- There need only be one data model for geometry editing and data use, not two or more.
- It is interoperable because all feature geometry storage adheres to simple features specifications from the OpenGIS Consortium and ISO.
- Data modeling is more natural because it is based on user features (such as parcels, streets, soil types, and watersheds) instead of topological primitives (such as nodes, edges, and faces). Users will begin to think about the integrity rules and behavior of their actual features instead of the integrity rules of the topological primitives. For example, how do parcels behave? This will enable stronger modeling for all kinds of geographic features. It will improve our thinking about streets, soil types, census units, watersheds, rail systems, geology, forest stands, landforms, physical features, and on and on.
- Geodatabase topologies provide the same information content as persisted topological implementations—either you store a topological line graph and discover the feature geometry (like ArcInfo coverages) or you store the feature geometry and discover the topological elements and relationships (like geodatabases).
In cases where users want to store the topological primitives, it is easy to create and post topologies and their relationships to tables for various analytic and interoperability purposes (such as users who want to post their features into an Oracle Spatial warehouse which stores tables of topological primitives).
At a pragmatic level, the ArcGIS topology implementation works. It scales to extremely large geodatabases and multiuser systems without loss of performance. It includes rich validation and editing tools for building and maintaining topologies in geodatabases. It includes rich and flexible data modeling tools that enable users to assemble practical, working systems on file systems, in any relational database, and on any number of schemas.