Metadata education suggestions and materials for:
|
Data Types |
Learning Material | Preparatory topics | Complementary topics | Vocabulary
Conviction
Motivation
Skills
Knowledge
There are many different types of data associated with geographic information technologies (GIS, GPS, remote sensing). The power of GIS, in particular, lies in its ability to combine or integrate data of many different types for analysis purposes. In order for this integration to take place, however, there needs to be a description of the characteristics of the data (at the very least its organization and reference information) in order for the GIS system to function using it. GIS systems are typically designed to handle data in several popular types of structures, or data models (such as vector and raster) and can convert between different software formats. However, there is a fundamental difference between whether the GIS can accept certain data for use (limited to its format) and whether that data is acceptable for those types of uses. In order for a GIS to accept data one set of information needs to be known (organization, reference system); in order for data to be acceptable another set of information, much more difficult to collect, should be known: quality (accuracy, completeness, consistency) and lineage (sources, development process). Metadata is critical for obtaining this information from different data types in order to determine whether they are acceptable for integration with each other.
Data types may be divided into two broad categories: geospatial (referenced to the earth) and non-geospatial. Non-geospatial data typically includes attributes associated with geospatial data, as well as other databases, graphics, and reports, not to mention the metadata itself, which is also considered a form of data.
Basic types of geospatial data can be described using a metaphor from the construction industry. A building must have a solid foundation of concrete or other material. Then a framework of wood or steel beams is connected to the foundation to create a structure to support the remainder of the building. The National Spatial Data Infrastructure (NSDI) consists of the following three "foundation" geospatial databases: 1) geodetic control, 2) elevation and 3) digital orthorectified imagery. Foundation spatial data are the minimal directly observable or recordable data from which other spatial data are referenced and sometimes compiled. The NSDI also incorporates the following four "framework" data themes: transportation, hydrography, governmental units (boundaries), and cadastral information (ownership units and/or parcel information). Finally, there are numerous other themes of spatial information that are collected or derived for specific purposes including cultural and demographic data, vegetation, wetland, soils, and geology to name a few. The importance of the foundation spatial datasets is that they provide a basis for registration of all other spatial data, therefore making it much easier to utilize and share the spatial information.
Though efforts exists to produce foundation datasets according to standards (see the Federal Geographic Data Committee's standards page), it may still be some years down the road before standardized foundation datasets are readily available for all areas of the U.S. for other datasets to be referenced to. In the meantime, existing datasets or datasets under development (outside of rigorously controlled federal data production programs) are usually not produced according to standards. As a result, metadata is of critical importance for evaluating and understanding the wide variety of data types and even wider variety of procedures used to produce them. Metadata is the first step in evaluating a dataset to see whether it meets certain criteria (area, scale, quality, time period, spatial reference, spatial organization, attributes) necessary in order for it to be integrated with other datasets for purposes of management and analysis. Without metadata, the time required to find, obtain, import, explore and test each potential dataset to determine its fitness-for-use would be prohibitive.
Even with the three foundation datasets, geodetic control, elevation and orthoimagery, there is vast difference in data characteristics. These are three fundamentally different data types. Geodetic control points are marked locations of known position, with very high levels of horizontal and vertical accuracy. Elevation data is typically derived by algorithms from topographic map contours, and orthoimagery is derived from scanned aerial photographs, in a process that involves rectification (removing errors of distortion) and registration. The sources, procedures, and final data structures for these three data sets illustrate the extreme ranges of differences in data types available. The range of differences expands even more when derived or modeled datasets are taken into consideration, for instance deriving vegetation datasets from satellite imagery, depth-to-groundwater contours from well information, socioeconomic patterns from census data and census units, or customer locations from road and address information.
9 of the 12 public uses of spatial data require geocoded address files (NAPA 1998). Address information is important to assessors, appraisers, real estate agents, 911, mortgage lenders, redistricting, and other users. In fact, the billion dollar business geographics industry is founded on the concept that an address can be assigned to topologically correct geographic coordinates and that the address can be used to navigate to the correct location. Despite this apparently high use of address-matched data, a search performed on the NSDI clearinghouse in January, 2000 (with 182 different clearinghouse nodes available), returned only three instances of metadata containing reference to address-matched datasets, and none of these metadata contained specifics on procedures used. There are many different methods and algorithms for arriving at geo-referenced address-matched data. Characteristics of this data type are further complicated by the fact that address-matched data is derived data: in other words, its quality depends on the quality of the source data (road networks and address attributes) used to create it. Data created by address-matching should have metadata describing its sources in detail, as well as the routine and parameters used for the address-matching process, methods used to correct errors and unsuccessful methods, and ideally, evaluation procedures used to determine the quality of the resulting data.
Basic information about the dataset (abstract, purpose) is found in the Identification Information section of metadata. This section should also contain a description of the area the data covers (if geospatial) and the area's bounding coordinates. It also contains a description (data, range of dates) of the time period covered by the dataset.
Data Quality Information, Spatial Organization Information and Spatial Reference Information are three major sections of metadata that contain vital information about the data's characteristics. In addition, the Entity/Attribute Information and Distribution Information sections may also contain relevant data.
Spatial Organization tells you how the data is organized: what data model it is stored in (the most common models are vector and raster). For vector data, it tells you the number of features (points, lines, polygons) contained in the data or the number of rows and columns and cell size of raster (pixel) data. For a more detailed discussion of the importance of this information, see the topic on Data models.
Example metadata for Spatial Organization:
Spatial Reference tells you what reference system and geodetic datum the data is stored in (coordinate system). There are hundreds of different projections and regional grid coordinates systems, with most of these having varying parameters. In order for data to be accurately integrated, it must be in the same system and datum with the same parameters. For a more detailed discussion of the importance of this information, see the topic on Projections and Scale.
Example metadata for Spatial Reference:
Data Quality tells you how accurate, complete, and consistent the data is, as well as describes its lineage (sources and procedures used in development). This information is probably the most important, though most often overlooked, part of data characteristics. The reason for this is that in order to get data to "work" in a GIS you need to have information about spatial organization and reference, but you don't need information on data quality and lineage. Please see the topic on Data quality for a discussion of evaluating a dataset's fitness-for-use.
Entity/Attribute Information tells you information about the database structure of the data. Most GIS data, at least the attribute information associated with it, is stored in a relational database, though object-oriented databases and data structures may become more common in the future. Most datasets do not come with readily interpretable attributes such as "Middle Road, two lane county road, asphalt" but is often coded and normalized (for instance, each segment of Middle Road is given a code, which is then referenced in a separate table with the full name and description). The section of metadata describes the different tables, table attributes (fields) and codes needed to interpret and use the database. In addition, some quality datasets contain information about the attributes themselves (who collected the information, when it was collected, how it was collected). This type of information is a type of metadata often referred to as "feature-level metadata". For more examples of Entity/Attribute metadata, as well as a description of how this section of metadata is organized, see the topic on database principles.
Distribution Information can contain information on how the dataset is transferred: transfer format and sometimes transfer size. This can be useful information since some GIS systems can only accept data in a certain format.