Metadata Education Project

Metadata education suggestions and materials for:

Data Sources
determining fitness-for-use

Learning Material | Preparatory topics | Complementary topics | Vocabulary


Learning Outcomes

Conviction

Motivation

Skills


Preparatory topics:


Complementary topics:


Vocabulary

Vocabulary definitions


Material for this topic


Finding data

"Not so long ago, most GIS projects had to rely almost exclusively upon data available only in printed form. Much of the data available for use is still published on paper, but a great deal of information is now distributed in digital formats. The ever-increasing pace of this transformation from paper to digital sources has many repercussions for GIS. Data already produced in digital format will certainly ease the work and speed the process of developing GIS, but only if users learn how to employ these new sources effectively. " (GC notes)

Metadata is critical in effective employment of data sources in two major ways: finding data, and evaluating the data's fitness-for-use. Metadata is of critical importance for evaluating and understanding the wide variety of data types and even wider variety of procedures used to produce them. (See the topic on data types.) Metadata is the first step in evaluating a dataset to see whether it meets certain criteria (area, scale, quality, time period, spatial reference, spatial organization, and attributes) necessary in order for it to be integrated with other datasets for purposes of management and analysis. Without metadata, the time required to find, obtain, import, explore and test each potential dataset to determine its fitness-for-use would be prohibitive.

Structured metadata is also very helpful for data mining on the internet. Increasingly, web-pages are employing the use of "meta tags" to help search-engines locate information more specifically and accurately. The National Spatial Data Infrastructure (NSDI) Clearinghouse program is designed to improve the search specifically for geospatial data. Metadata produced according to the FGDC's Content Standard for Digital Geospatial Metadata (CSDGM) can be considered an expanded, more comprehensive version of "meta tags", enabling more powerful and refined searches (see the topic on clearinghouse concepts for more details.)

Skills: Using the NSDI Clearinghouse

The amount of nodes participating in the NSDI clearinghouse activity has continued to expand since its inception in 1996, but the fact remains that many sources of data are not yet "catalogued" by their metadata through this official network. Data sources may be available through clearinghouses not participating in the NSDI clearinghouse, as well as through regional/local governments, education institutions, private institutions and businesses. In these cases, though metadata may not help to locate data, it is still an important element in evaluating data's fitness-for-use.


Evaluating fitness-for-use: questions to ask about data

"When looking for spatial data, the GIS user should consider the application and how the data will be used. Through this process, the user may want to address the following issues: (NC-CGIA)

Using data from many different sources introduces the problem of "a chain is only as strong as its weakest link" applies here; the accuracy of maps and models from a GIS are only as accurate as the least accurate data source.

More on data sources and data sharing from a page created for the Fundamental of GIS course at Harvard School of Design taught by Paul Cote.

Checking on Pedigree and Quality: Becoming a Smart Shopper (credits to GC notes)

"As you consider available digital data, become a smart, well-informed shopper. It is said that an undocumented dataset is a worthless dataset. There is much truth in this assertion because, if you do not know what is in a dataset--its pedigree and quality, for example--you, the user, have to spend time checking it yourself. These days, you should expect to receive with your data some sort of "data quality report" from the vendor or provider. This report will provide a description of exactly what is in the file, how the information was compiled (and from what sources), and how the data was checked. The documentation for some products is quite extensive and much of the detailed information may be published separately, as it is for USGS digital products."

If documentation is limited, it is important for you to consider the following questions:

Located in the Identification information section:

Located in the Data Quality section:

Located in the Entity and Attribute section:

Located in the Spatial Organization section:

Located in the Spatial Reference section:

Located in the Distribution section:

The largest number of questions typically comes from the Data Quality section of metadata. There are many different ways of evaluating data accuracy, both in terms of its spatial and attribute (thematic) content. Accuracy is often related to the sources and procedures used to develop the data, so the lineage section of metadata (source information, process steps) is also very important to review. For specifics related to these elements of data quality, please see the topic on Data Quality.

Metadata does not have to be in a standard format for evaluating fitness-for-use, though it is helpful. Some data providers provide documentation in the form of a quality report, dataset "pedigree" or technical report. If information about the data's sources, development procedures and quality is not available, the data should not be used without testing it first against other sources of known quality.


Exercises to demonstrate the importance of metadata

Many class exercises and projects involve the integration of several different data sources. In some cases, there may even be the same theme of data available from different sources. For instance, the Natural Resources Conservation Service provides two types of soil datasets: the STATSGO dataset produced at 1:250,000 scale and the SSURGO dataset produced at 1:24,000. Metadata documents are available for both datasets. Require the students to review the metadata in order to decide which dataset is most appropriate for a particular exercise or project. For instance, for statewide or regional strategic analysis/planning, STATSGO provides a broad overview; the more detailed SSURGO data is not always available for continuous large areas and can be prohibitively large and difficult to analyze because of its level of detail.

Another example is vegetation or land cover. The National Gap Analysis program has coordinated vegetation mapping activities across the country. Many states have used different methods for creating statewide vegetation/land cover datasets. Most vegetation datasets have been produced at a 1:100,000-scale, but some states invested additional time and money to produce more detailed datasets. In addition, some vegetation datasets were created in raster format; others were developed as vector datasets. What implications are involved in choosing a data model to perform a particular application or analysis and what are the potential issues in converting datasets from a raster to a vector model or vice versa? (See the topic on data models for more details). Have students select two or more states and compare the source-scales/resolution, data models, and methods used for the vegetation datasets, using metadata documents. (CA, ID, NH, NM, NV, UT, VT, WA and WY currently available).

For both the NRCS soils datasets and the National Gap vegetation datasets, it is also important to review the reasons why the datasets were created (their specific intended purpose). This gives valuable insight into other potential uses for the data.

Exam and/or discussion question:
In some cases, the only data that is available for a needed application is questionable. For instance, detailed vegetation data (1:24,000) can be used to map forest stands. The forest stands should be further characterized by precipitation for good management practices, but the only precipitation data available is at a very small scale, 1:2,000,000. Ask the students if it is acceptable to combine these two sources of data for planning purposes. What kind of tests could be performed on the data to determine its fitness-for-use? For instance, this example was used by the University of Washington's Introduction to GIS in Forest Resources class with a fitness-for-use test of comparing the actual line-width of features from the vegetation map to features from the precipitation contour map.


References

Data sources and acquisition from the Geographer's Craft

North Carolina Center for Geographic Information and Analysis, "Accessing Data Tutorial" http://cgia.cgia.state.nc.us:80/tutorials/index.html


Back to Course Topics