Metadata education suggestions and materials for:
|
Data Quality / Error and Uncertainty |
Learning Material | Preparatory topics | Complementary topics | Vocabulary
Learning Outcomes
Conviction
- Always question data quality: garbage in = garbage out
Motivation
- Know ahead of time the error/uncertainty associated with data sets before investing time/money in using them or updating them
Skills
General skills:
- Where to find information about data quality in the metadata content standard
- How to evaluate the information in the Data Quality section: is enough information provided or are additional quality tests required?
Advanced skills:
- How to enter Data Quality Information including sources and process steps in the metadata content standard
- How to create data with feature-level metadata containing information about sources and quality of sources
Knowledge
- Knowledge of sources of error in spatial data
- Knowledge of techniques for data quality assessment, both quantitative and qualitative
- Awareness of difference between quality control and truth-in-labeling paradigms
- Awareness of case examples where insufficient information about data quality resulted in inappropriate use or faulty analysis
- Awareness of different types of accuracy (conceptual, spatial, temporal and thematic)
- Awareness of techniques for handling error, such as feature-level metadata, and sensitivity analysis
- Awareness of potential for software to automatically read metadata and communicate data quality visually to users/decision makers
Preparatory topics:
Complementary topics:
Vocabulary
Vocabulary definitions
General:
- Positional Accuracy
- Attribute Accuracy
- Logical Consistency
- Completeness
- Precision
- Content Standard for Digital Geospatial Metadata (CDSGM)
- National Standard for Spatial Data Accuracy (NSSDA)
- ISO
Material for this topic
The importance of documentation in data quality
There has always been awareness of the fact that the quality of data varies, especially in terms of its accuracy and precision, but until recently there has been very little consideration of the actual effect of these factors on GIS solutions. In part, this is because "GIS are quite capable of lulling the user into a false sense of accuracy and precision unwarranted by the actual data.... It is now generally recognized that error, inaccuracy and imprecision can make or break many types of GIS projects... every time a new dataset is imported, the GIS also inherits its errors. These may combine with the errors already in the database in unpredictable ways." (GC notes).
"The value of any geographic data set depends less on its cost, and more on its fitness for a particular purpose. A critical measure of that fitness is data quality. When used in GIS analysis, a dataset's quality significantly affects confidence in the results. Unknown data quality leads to tentative decisions, increased liability and loss of productivity. Decisions based on data of known quality are made with greater confidence and are more easily explained and defended." (LMIC 1999).
Two examples of how datasets of unknown data quality affected GIS projects:
- Wyoming well locations: unknown accuracy caused delays in GIS project
- Minnesota GPS locations: inaccurate positions resulted in confusion over which of two datasets compared was actually in error
Detecting and avoiding errors and error propagation in GIS is the responsibility of both data producers and data users. Data producers need to become more responsible in documenting their datasets; likewise, data users need to acquire the ability to interpret the documentation correctly to determine the dataset's fitness-for-use.
Scientific research in the fields of chemistry, biology, physics, geology and other sciences has always involved careful documentation of methods, materials, and measurements used. For instance, all principal investigators in the US Global Change Research Program (USGCRP) are required to submit: (Beard 1996)
- pre-data collection plans that describe in detail the sampling, collection, and analysis methodologies
- a detailed inventory of all measurements actually made along with documentation of the measurement techniques used to produce the data
- an estimate of the accuracy and precision of each measurement along with procedures used to correct errors, remove noise, or otherwise modify the data
- documentation of the physical setting of the location under study
- correction and improvements in data made subsequent to analysis conclusions or final submission of data
Documentation in scientific research is assumed because of the necessity of repetition and exchange of information that frequently occurs. Data collected via GIS, remote sensing, or GPS technologies for spatial analysis should be no less rigorously documented. If GIS data is created and used by a city planner for strategic planning, is it fundamentally less "scientific" than data collected for a scientific research project? The gray line dividing scientific analysis from analysis for management or planning purposes is further blurred by the increasingly common practice of integrating scientific models within the GIS analysis environment. This adds a new dimension to the abstraction of real world features and processes into a GIS database. Without splitting hairs on the semantics of the term "science", data quality and documentation should be as much of an issue in the GIS realm as it is has been in traditional scientific realms.
Definitions and Components of Data Quality
Definitions
There are several different approaches to defining data quality:
- Traditionally, data quality is a description of the lineage, completeness and consistency and the accuracy and precision of the data
- Conformance to expectations: data quality can be described based on arbitrary thresholds. "What is good enough for our needs"
- Following established procedures: standards for quality of geodetic locations are based on conformance to procedures (as with previous definition) that have been formalized
- Truth in labeling: the data producers provides as much information as possible about the data, but it is the data user's responsibility to decide the data's quality or fitness-for-use
Components
From the Spatial Data Quality section of the Spatial Data Transfer Standard (Federal Information Processing Standard 173, Dept. of Commerce). These components conform to the metadata elements of the Data Quality section of the FGDC's Content Standard for Digital Geospatial Metadata (CSDGM). Conceptual and Temporal accuracy are not specifically included in the CSDGM, though they should be noted as necessary.
- Lineage:
narrative of source materials used and procedures used to produce the product
- Positional accuracy:
description of the expected error or range of error. Error encompasses both inaccuracy and imprecision
- Accuracy
: difference between measured quantities and "true" or accepted values and how these values were arrived at
- Precision
: number of significant digits used to record measurements, or exactness
- Attribute accuracy:
error as reported by percentage of correct/incorrect label values or a misclassification matrix, usually based on a sample of the data of description.
- Logical Consistency
: description of the fidelity of relationships encoded in the data structure, such as permissible values, occurrences (no duplications) and relationships (e.g. intersections, overshoots, undershoots)
- Completeness
: selection criteria used (exhaustive, or generalized) and thresholds used such as minimum mapping units (area and/or width)
Conceptual Accuracy
"GIS depend upon the abstraction and classification of real-world phenomena. The user determines what amount of information is used and how it is classified into appropriate categories. Sometimes users may use inappropriate categories or misclassify information." (from GC notes)
Temporal Accuracy
Geographical data includes the three dimensions of space, time, and theme (where-when-what). The FGDC's Content Standard for Digital Geospatial Metadata data quality sections includes elements for describing positional accuracy (where) and attribute or thematic accuracy (what) but no specific element for recording temporal accuracy. Temporal coordinates are often only implicit in geographical data, e.g., a time stamp indicating that the entity was valid at some time. Often this is applied to the entire database (e.g., a map dated 1995) (NCGIA CC notes). Temporal information for the dataset is recorded in the Identification Information section under the sub-section Time period of content, in which a single date/time or range of dates/times can be recorded. In addition, this sub-section contains an element, Currentness Reference, which provides more information about the dates/times, as to whether they refer to ground condition (as in, the data/time an aerial photograph was taken) or to publication date (the date a map or dataset was published). The Lineage sub-section of metadata, found within the Data Quality section, contains more information that indirectly relates to temporal accuracy. For each of the sources contributing to the dataset, there is also a Time Period of Content sub-section. Many times a published date is given for a dataset as a whole, but a more realistic description of its temporal content is contained within the individual sources.
Sources of inaccuracy, imprecision and other "fitness-for-use" issues
A statement of estimated accuracy is an obvious element of data quality to look for, but others are less obvious or more difficult to discern. Furthermore, a dataset's "quality" is not always the same thing as its "fitness-for-use" as data of a particular quality may be acceptable for some uses and not for other uses. Elements that should be considered when evaluating a dataset's quality and fitness-for-use may not always be found under the Data Quality section, either.
- Age of data
: data sources may be too old to be useful or relevant to current GIS projects. Past collection standards may be unknown, non-existent, or not currently acceptable. Check the Time Period of Content sections for both the dataset (Identification Information section) and the dataset's sources (Data Quality section, Lineage: Source Information sub-section)
- Areal cover
: ideally, data sources for a particular application should have uniform coverage. Different themes of data for the same area may have slightly different borders, depending on their sources.
- Map scale
: this is very important for determining data's fitness for use. Please see the topic on Projections and scale. Check the Data Quality section, Lineage: Source Information sub-section.
- Density of observations
: or level/method of sampling. An insufficient number of observations may not provide the level of resolution (related to scale) required for a particular application. Sources of some GIS or map data may have a higher level of sampling than is portrayed in map form, since data is sometimes generalized to make it more "readable." Check the Data Quality section, Lineage: Source Information and Lineage: Process steps sub-sections.
- Relevance
: very often a detailed data dictionary, or listing of the dataset's features and attributes (Entity and Attribution Information section), can provide enough information to determine if the data is relevant for a specific use. For instance, a dataset of roads will not be relevant for address matching if the road segments are only classified by type and not by name and address. Relevance is also a factor when a desired dataset does not exist and surrogate data has to be considered. For instance, satellite imagery is often used as a surrogate for field vegetation surveys. The sensor on the satellite does not "see" vegetation, only certain types of digital signatures typical of different types of vegetation, which can be variable. The user must understand the limitations and assumptions inherent with surrogate data.
- Positional accuracy
: there are a large number of factors that contribute to positional accuracy, including source materials used and digitizing methods, which are discussed in more detail in following sections. In addition, many spatial phenomena have indeterminate boundaries and cannot be described as distinct features with associated positional and attribute accuracy. Precipitation is a good example. Land cover and elevation are usually also indeterminate or "fuzzy" since it is impractical to measure these phenomena at a scale large enough to capture them at determinate measurements. In such cases, features are often an estimate or interpretation of the cartographer.
- Sources of variation in data
: sometimes one dataset is compiled by more than one person. There can be a range of measurement error introduced by faulty observation, biased observers, or the use of different equipment (e.g. changes in calibration or settings). Check the Data Quality section, Lineage: Source Information and Lineage: Process steps sub-sections.
- Numerical errors
: all GIS data is composed of coordinates, which are stored as numbers with a certain level of precision in the computer. Variation in the amount of significant digits handled by the computer or by the GIS software can cause rounding errors that may not be noticeable during processing but can become apparent when results are tabulated. For instance, an insufficient number of significant digits can result in incorrect area and length calculations. Lack of sufficient coordinate precision can also result in "coordinate shift" during processing. If points or nodes are located very close together, so that a large number of significant digits are required to define their specific location, these locations can sometimes end up "snapping" together into one location defined by a lesser amount of significant digits. Check the Data Quality information, Positional Accuracy sub-section for this information.
- Topological errors
: overlaying multiple layers of data in GIS can result in problems such as "sliver polygons", or virtual data which may be difficult to detect from real data. This especially true in the cases where data with the same borders or boundaries digitized at different accuracy levels are combined. The sliver polygons result in incorrect area and length calculations. Check the Data Quality section, Lineage: Source Information and Lineage: Process steps sub-sections.
- Classification problems
: defining appropriate class intervals can be a subjective process that lends itself easily to bias. Entire books have been written about "lying with maps". It is usually best to get data in an unclassified format. Interpolation methods are another form of classification that users should be aware of. Interpolation uses statistics and mathematical equations to translate sampled data into contours or other types of surface representations. Background information about the interpolation methods used and the limitations are associated with it should be recorded in the metadata Check the Data Quality section, Lineage: Source Information and Lineage: Process steps sub-sections, as well as the Entity and Attribute Information section.
- Digitizing and geocoding errors
: these include a range of errors such as
- poor source map material (paper versus more stable media)
- faulty registration (inaccurate registration marks, not enough registration marks used)
- physiological errors of the digitizer (sloppy digitizing caused by involuntary muscle contractions or drifting attention)
- scanning errors and means for correcting distortions in scanned materials or aerial/remotely sensed images.
Check the Data Quality information, Positional Accuracy sub-section for this information.
(modified from GC notes)
Skills: assessing and describing data quality
The FGDC's Content Standards for Digital Geospatial Metadata contains a section for describing the data quality of a dataset, containing these sub-sections:
These sub-sections are typically used to describe the dataset as a whole. The new ISO standard for geospatial metadata (in draft form as of 2/2000), has the same sub-sections for describing data quality, but in addition this standard will allow separate data quality descriptions for data series, datasets, and features of the dataset. For instance, some features of the dataset may have been collected at a certain time and using a certain method, with a specific data quality description. At a later time, more features may have been added using different methods and therefore resulting in a different data quality description for attribute accuracy, though positional accuracy might remain the same. The ISO standard allows for a more structured description of different data quality aspects based on data granularity (the dataset as a whole, or different portions or even individual features of the dataset).
Attribute or Thematic Accuracy: There are three methods of determining accuracy:
- visual check ("not checked", "20% of attributes checked", etc): best to plot transparencies to overlay against source attributes
- compare attributes to a separate data source of larger scale/greater accuracy (this can be difficult in some cases to evaluate because of generalization that occurs in smaller scale datasets)
- independent sampling (field checks)
Examples:
- good example: rare case where accuracy was determined by field-checking
- good example: an example of how updates to the attribute accuracy were incorporated (with dates) as the data was improved.
- special example: in modeled datasets, attribute accuracy sometime does not apply since the process is subjective or derivative.
- special example: reference is made to individual source data sets within the Lineage section. This is acceptable, but not ideal.
Horizontal Positional Accuracy:
The National Standard for Spatial Data Accuracy (NSSDA) was approved in 1998 to address the growing need for quality spatial data and to provide a common language for reporting accuracy.
"How the positional accuracy of map features is best estimated has been debated since the early days of cartography. Until recently, existing accuracy standards such as the National Map Accuracy Standards focused on testing paper maps, not digital data. The NSSDA is one in a suite of standards dealing with the accuracy of geographic datasets, and is one of the most recent standards to be issued by the Federal Geographic Data Committee" (LMIC 1999).
There are seven steps in applying the NSSDA:
- Determine if the test involves horizontal accuracy, vertical accuracy, or both
- Select a set of test point s from the dataset to be evaluated
- Select an independent dataset of higher accuracy that corresponds to the dataset being tested
- Collect measurements from identical points from each of these two sources
- Calculate a positional accuracy statistic using either the horizontal or vertical accuracy statistic worksheet
- Prepare an accuracy statement in a standardized report form
- Include that report in a comprehensive description of the dataset called metadata
At least twenty points are required to conduct a statistically significant accuracy evaluation at the 95% confidence level. Coordinate values for both test points of both the test dataset and the independent dataset are collected and three statistics are computed for each pair of points:
- the sum of the set of squared differences
- the average of the sum by dividing the sum by the number of test points
- the root mean square error (RMSE) which is simply the square root of the average
- the NSSDA statistic, which is determining by multiplying the RMSE by a value that represents the standard error of the mean at the 95% confidence level: 1.7308 for horizontal accuracy and 1.9600 for vertical accuracy.
The result of applying the NSSDA to a dataset is an accuracy statement such as this "Tested 0.181 meters horizontal accuracy at 95% confidence interval using the NSSDA." In the horizontal positional accuracy section of the metadata, it is also important to describe the independent dataset used for comparison, its source(s) and associated accuracy. For more information about using the NSSDA, see LMIC 1999.
In some cases, independent datasets of higher accuracy are not available to perform the NSSDA accuracy calculations. Additional data may be collected for accuracy assessment, involving field work and/or GPS collection. If resources do not exist for collecting additional data for the purposes of accuracy assessment, the Spatial Data Transfer Standard describes three alternatives for determining positional accuracy:
- deductive estimate
: based on knowledge of errors in each production step and assumptions concerning error propagation.
- internal evidence
: tested based on repeated measurement and redundancy such as closure of traverse or residuals from an adjustment
- comparison to source
: when using graphic inspection of results (check plots), the geometric tolerances applied shall be reported and the method of registration shall also be described
Here are some factors to consider when using the deductive estimate to describe positional accuracy, based on production steps:
- GPS data
: the type of GPS equipment (mapping grade, survey grade), settings, number of satellites used, logging intervals of position, and post-processing techniques (such as differential correction) all contribute to the final accuracy of the data.
- Images
: specifications of aerial photography including type of photograph/sensor, calibration test(s) used, aerotriangulation technique, ground control reliability, photogrammetric characteristics
- Maps or images
: note the RMS (root mean square) error, the type of registration marks and the number of registration marks used.
- In addition to registration error, error should be quantified, or at the very least estimated, for each of these steps in the production process:
- Inherited error from source maps (For example, USGS topographic maps must comply to national mapping accuracy standards which require a 1:24,000-scale map to have +/- 40 feet accuracy). This can also include a description of scribing precision and printing limitations. and error resulting from poor quality source media (e.g. folded or warped paper or photocopies used instead of original source map)
- Error from snapping tolerance. This tolerance should be set based on consideration of the scale of the source data (larger scale maps should have smaller snapping tolerances).
- Error from coordinate shift resulting from insufficient precision (significant digits) used to store coordinates. Coordinate shift can result from certain GIS operations such as overlays, clips, and projections.
- Error from line shift. This is the distance that digitized lines are off from source lines, resulting from human error. Creating check plots and overlaying them on the source maps checks this error.
Positional accuracy is less of an issue for some datasets. For instance, the 1:100,000-scale land cover map for Wyoming was interpreted from satellite imagery with a specific minimum mapping unit (100 ha). Because 100 ha units do not apply to actual vegetation boundaries (it is too generalized), positional accuracy
of the individual polygon shapes of the data was not evaluated. However, the overall accuracy of the dataset was still evaluated to make sure that the boundaries of the dataset were within the acceptable range of accuracy for a 1:100,000-scale map.
Examples:
Logical Consistency: does not usually apply to point data or raster data. For line or polygon data, check for these types of errors:
- dangles/unclosed polygons
- slivers
- duplicate lines
- intersections without nodes
- polygons with more than one label
- adjacent polygons with the same attribute
Completeness:
Completeness can be measured in space, time, or theme. Consider a database of buildings in Minnesota that have been placed on the National Register of Historic Places as of the end of 1995 (NCGIA CC notes)
- Spatial completeness: The list contains only buildings in Hennepin County (one county in Minnesota, rather than all of Minnesota).
- Temporal completeness: The list contains only buildings placed on the Register by June 30, 1995.
- Thematic completeness: The list contains only residential buildings.
Lineage:
This section includes information on sources used and processed used to create or modify the data. It is important for determining source scales / resolution, density of observations / sampling methods, interpolation or classification methods, and many other types of processing functions that could have potential impact on the data's quality or fitness-for-use.
Examples of source information:
- good example: includes detailed citation, source-scale denominator and source contribution. Only field missing is type of source media (e.g. digital, paper, magnetic)
Examples of process steps:
Example exercises for incorporating metadata
Web-exercise: have students search the web (for instance, using the NSDI Geospatial Data Clearinghouse for three examples of metadata documents. In a paper or as a class discussion, have them discuss the strengths and weaknesses of the information in the Data Quality section. What kind of tests (if any) were used to determine the data's accuracy? What procedures could be used to determine an estimate of accuracy, if one does not exist? Is the development procedure of the data clearly described in the Process Steps, or is the section just a collection of commands? Would a potential data user be able to determine the data's fitness-for-use given the information in the Data Quality section?
Apply the National Standard for Spatial Data Accuracy (NSSDA): provide an dataset for students to evaluate positional accuracy, and an independent dataset of higher accuracy for a basis of comparison. The Minnesota Land Management Information Center has a handbook on applying the NSSDA as well as worksheets for calculating the statistics. Choosing appropriate data for this exercise is important; the NSSDA is not readily applied to all datasets. For instance, it can be very difficult to identify test points for comparison on vegetation datasets. Datasets with distinct points, corners or intersections are the best candidates for this type of exercise. Digital orthophotos produced by the USGS are good candidates for the independent dataset, since these digital photos have identifiable road intersections and are required to conform to National Map Accuracy Standards of 1:12,000 or +/- 33.3 feet. Require students to look up the positional accuracy statement of the independent dataset in its metadata. In addition, students can be required write a positional accuracy statement for the test dataset based on their results. This is also a good exercise for classes incorporating GPS exercises, as students could collect the minimum 20 points required for the comparison using differentially-corrected GPS locations.
For more exercise ideas, see the example exercises in the topic on Data Sources (determining fitness-for-use).
Advanced data quality issues: handling error
Under construction
References and additional sources
Beard, K. 1996. A structure for organizing metadata collection. National Center for Geographic Information and Analysis, Maine.
Minnesota Land Management Information Center(LMIC) 1999. Positional Accuracy Handbook: Using the National Standard for Spatial Data Accuracy to measure and report geographic data quality.
Veregin, H. 1998. Data Quality Measurement and Assessment from NCGIA's Core Curriculum for Geographic Information Science, posted March 23, 1998.
Error, Accuracy and Precision from the Geographer's Craft
Handling Uncertainty from the Geographer's Craft
Back to Course Topics