Metadata Education Project

Metadata education suggestions and materials for:

Data Quality / Error and Uncertainty

Learning Material | Preparatory topics | Complementary topics | Vocabulary


Learning Outcomes

Conviction

Motivation

Skills

General skills:

Knowledge


Preparatory topics:


Complementary topics:


Vocabulary

Vocabulary definitions

General:


Material for this topic


The importance of documentation in data quality

There has always been awareness of the fact that the quality of data varies, especially in terms of its accuracy and precision, but until recently there has been very little consideration of the actual effect of these factors on GIS solutions. In part, this is because "GIS are quite capable of lulling the user into a false sense of accuracy and precision unwarranted by the actual data.... It is now generally recognized that error, inaccuracy and imprecision can make or break many types of GIS projects... every time a new dataset is imported, the GIS also inherits its errors. These may combine with the errors already in the database in unpredictable ways." (GC notes).

"The value of any geographic data set depends less on its cost, and more on its fitness for a particular purpose. A critical measure of that fitness is data quality. When used in GIS analysis, a dataset's quality significantly affects confidence in the results. Unknown data quality leads to tentative decisions, increased liability and loss of productivity. Decisions based on data of known quality are made with greater confidence and are more easily explained and defended." (LMIC 1999).

Two examples of how datasets of unknown data quality affected GIS projects:

Detecting and avoiding errors and error propagation in GIS is the responsibility of both data producers and data users. Data producers need to become more responsible in documenting their datasets; likewise, data users need to acquire the ability to interpret the documentation correctly to determine the dataset's fitness-for-use.

Scientific research in the fields of chemistry, biology, physics, geology and other sciences has always involved careful documentation of methods, materials, and measurements used. For instance, all principal investigators in the US Global Change Research Program (USGCRP) are required to submit: (Beard 1996)

Documentation in scientific research is assumed because of the necessity of repetition and exchange of information that frequently occurs. Data collected via GIS, remote sensing, or GPS technologies for spatial analysis should be no less rigorously documented. If GIS data is created and used by a city planner for strategic planning, is it fundamentally less "scientific" than data collected for a scientific research project? The gray line dividing scientific analysis from analysis for management or planning purposes is further blurred by the increasingly common practice of integrating scientific models within the GIS analysis environment. This adds a new dimension to the abstraction of real world features and processes into a GIS database. Without splitting hairs on the semantics of the term "science", data quality and documentation should be as much of an issue in the GIS realm as it is has been in traditional scientific realms.


Definitions and Components of Data Quality

Definitions
There are several different approaches to defining data quality:

Components

From the Spatial Data Quality section of the Spatial Data Transfer Standard (Federal Information Processing Standard 173, Dept. of Commerce). These components conform to the metadata elements of the Data Quality section of the FGDC's Content Standard for Digital Geospatial Metadata (CSDGM). Conceptual and Temporal accuracy are not specifically included in the CSDGM, though they should be noted as necessary.

Conceptual Accuracy

"GIS depend upon the abstraction and classification of real-world phenomena. The user determines what amount of information is used and how it is classified into appropriate categories. Sometimes users may use inappropriate categories or misclassify information." (from GC notes)

Temporal Accuracy

Geographical data includes the three dimensions of space, time, and theme (where-when-what). The FGDC's Content Standard for Digital Geospatial Metadata data quality sections includes elements for describing positional accuracy (where) and attribute or thematic accuracy (what) but no specific element for recording temporal accuracy. Temporal coordinates are often only implicit in geographical data, e.g., a time stamp indicating that the entity was valid at some time. Often this is applied to the entire database (e.g., a map dated 1995) (NCGIA CC notes). Temporal information for the dataset is recorded in the Identification Information section under the sub-section Time period of content, in which a single date/time or range of dates/times can be recorded. In addition, this sub-section contains an element, Currentness Reference, which provides more information about the dates/times, as to whether they refer to ground condition (as in, the data/time an aerial photograph was taken) or to publication date (the date a map or dataset was published). The Lineage sub-section of metadata, found within the Data Quality section, contains more information that indirectly relates to temporal accuracy. For each of the sources contributing to the dataset, there is also a Time Period of Content sub-section. Many times a published date is given for a dataset as a whole, but a more realistic description of its temporal content is contained within the individual sources.


Sources of inaccuracy, imprecision and other "fitness-for-use" issues

A statement of estimated accuracy is an obvious element of data quality to look for, but others are less obvious or more difficult to discern. Furthermore, a dataset's "quality" is not always the same thing as its "fitness-for-use" as data of a particular quality may be acceptable for some uses and not for other uses. Elements that should be considered when evaluating a dataset's quality and fitness-for-use may not always be found under the Data Quality section, either.

Check the Data Quality information, Positional Accuracy sub-section for this information.

(modified from GC notes)


Skills: assessing and describing data quality

The FGDC's Content Standards for Digital Geospatial Metadata contains a section for describing the data quality of a dataset, containing these sub-sections:

These sub-sections are typically used to describe the dataset as a whole. The new ISO standard for geospatial metadata (in draft form as of 2/2000), has the same sub-sections for describing data quality, but in addition this standard will allow separate data quality descriptions for data series, datasets, and features of the dataset. For instance, some features of the dataset may have been collected at a certain time and using a certain method, with a specific data quality description. At a later time, more features may have been added using different methods and therefore resulting in a different data quality description for attribute accuracy, though positional accuracy might remain the same. The ISO standard allows for a more structured description of different data quality aspects based on data granularity (the dataset as a whole, or different portions or even individual features of the dataset).

Attribute or Thematic Accuracy: There are three methods of determining accuracy:

Examples:

Horizontal Positional Accuracy:
The National Standard for Spatial Data Accuracy (NSSDA) was approved in 1998 to address the growing need for quality spatial data and to provide a common language for reporting accuracy.

"How the positional accuracy of map features is best estimated has been debated since the early days of cartography. Until recently, existing accuracy standards such as the National Map Accuracy Standards focused on testing paper maps, not digital data. The NSSDA is one in a suite of standards dealing with the accuracy of geographic datasets, and is one of the most recent standards to be issued by the Federal Geographic Data Committee" (LMIC 1999).

There are seven steps in applying the NSSDA:

  1. Determine if the test involves horizontal accuracy, vertical accuracy, or both
  2. Select a set of test point s from the dataset to be evaluated
  3. Select an independent dataset of higher accuracy that corresponds to the dataset being tested
  4. Collect measurements from identical points from each of these two sources
  5. Calculate a positional accuracy statistic using either the horizontal or vertical accuracy statistic worksheet
  6. Prepare an accuracy statement in a standardized report form
  7. Include that report in a comprehensive description of the dataset called metadata

At least twenty points are required to conduct a statistically significant accuracy evaluation at the 95% confidence level. Coordinate values for both test points of both the test dataset and the independent dataset are collected and three statistics are computed for each pair of points:

The result of applying the NSSDA to a dataset is an accuracy statement such as this "Tested 0.181 meters horizontal accuracy at 95% confidence interval using the NSSDA." In the horizontal positional accuracy section of the metadata, it is also important to describe the independent dataset used for comparison, its source(s) and associated accuracy. For more information about using the NSSDA, see LMIC 1999.

In some cases, independent datasets of higher accuracy are not available to perform the NSSDA accuracy calculations. Additional data may be collected for accuracy assessment, involving field work and/or GPS collection. If resources do not exist for collecting additional data for the purposes of accuracy assessment, the Spatial Data Transfer Standard describes three alternatives for determining positional accuracy:

Here are some factors to consider when using the deductive estimate to describe positional accuracy, based on production steps:

Positional accuracy is less of an issue for some datasets. For instance, the 1:100,000-scale land cover map for Wyoming was interpreted from satellite imagery with a specific minimum mapping unit (100 ha). Because 100 ha units do not apply to actual vegetation boundaries (it is too generalized), positional accuracy of the individual polygon shapes of the data was not evaluated. However, the overall accuracy of the dataset was still evaluated to make sure that the boundaries of the dataset were within the acceptable range of accuracy for a 1:100,000-scale map.

Examples:

Logical Consistency: does not usually apply to point data or raster data. For line or polygon data, check for these types of errors:

Completeness:
Completeness can be measured in space, time, or theme. Consider a database of buildings in Minnesota that have been placed on the National Register of Historic Places as of the end of 1995 (
NCGIA CC notes)

Lineage:
This section includes information on sources used and processed used to create or modify the data. It is important for determining source scales / resolution, density of observations / sampling methods, interpolation or classification methods, and many other types of processing functions that could have potential impact on the data's quality or fitness-for-use.

Examples of source information:

Examples of process steps:


Example exercises for incorporating metadata

Web-exercise: have students search the web (for instance, using the NSDI Geospatial Data Clearinghouse for three examples of metadata documents. In a paper or as a class discussion, have them discuss the strengths and weaknesses of the information in the Data Quality section. What kind of tests (if any) were used to determine the data's accuracy? What procedures could be used to determine an estimate of accuracy, if one does not exist? Is the development procedure of the data clearly described in the Process Steps, or is the section just a collection of commands? Would a potential data user be able to determine the data's fitness-for-use given the information in the Data Quality section?

Apply the National Standard for Spatial Data Accuracy (NSSDA): provide an dataset for students to evaluate positional accuracy, and an independent dataset of higher accuracy for a basis of comparison. The Minnesota Land Management Information Center has a handbook on applying the NSSDA as well as worksheets for calculating the statistics. Choosing appropriate data for this exercise is important; the NSSDA is not readily applied to all datasets. For instance, it can be very difficult to identify test points for comparison on vegetation datasets. Datasets with distinct points, corners or intersections are the best candidates for this type of exercise. Digital orthophotos produced by the USGS are good candidates for the independent dataset, since these digital photos have identifiable road intersections and are required to conform to National Map Accuracy Standards of 1:12,000 or +/- 33.3 feet. Require students to look up the positional accuracy statement of the independent dataset in its metadata. In addition, students can be required write a positional accuracy statement for the test dataset based on their results. This is also a good exercise for classes incorporating GPS exercises, as students could collect the minimum 20 points required for the comparison using differentially-corrected GPS locations.

For more exercise ideas, see the example exercises in the topic on Data Sources (determining fitness-for-use).


Advanced data quality issues: handling error

Under construction


References and additional sources

Beard, K. 1996. A structure for organizing metadata collection. National Center for Geographic Information and Analysis, Maine.

Minnesota Land Management Information Center(LMIC) 1999. Positional Accuracy Handbook: Using the National Standard for Spatial Data Accuracy to measure and report geographic data quality.

Veregin, H. 1998. Data Quality Measurement and Assessment from NCGIA's Core Curriculum for Geographic Information Science, posted March 23, 1998.

Error, Accuracy and Precision from the Geographer's Craft

Handling Uncertainty from the Geographer's Craft


Back to Course Topics