What is metadata?
Metadata is data about your data. Metadata allows users, both internal as well as external, to better understand, organize, and classify data. Metadata can range from simple explanations to nuanced information about the data.
There are four “types” of metadata:
- Administrative metadata is the most common and is produced in data collection, production, publication, and archiving. Most metadata are in this category.
- Structural metadata describes a dataset’s structure, including its format, organization, and variable definitions. This is highest in demand by researchers and academics.
- Reference/descriptive metadata is a broad term that mostly involves descriptions of methodology, sampling, and quality.
- Behavioral metadatarecords the reactions and behaviors of the dataset’s users such as a rating or user analytics.
Why create metadata guidelines?
As an organization publishes data, it will be increasingly critical that the data is well organized and able to be easily understood by its users, both internal and external.
Having a documented metadata policy can help a data program in a number of ways:
- Standardization - Ensuring that, across an organization, the same metadata is being collected and in the same format
- Alignment with industry standards - Utilizing industry practices and standards helps data to be compared to other data by users
- Improving data discoverability and categorization - As a program grows, helping users both internal and external locate the right datasets easily and quickly is paramount
- Improving data quality - By setting guidelines and required metadata early in a program, with regular usage reviews, an organization can foster a culture of high-quality and complete metadata
- Improve and provide more depth in analytics - By providing reference metadata such as methodology and sampling, users are empowered to know exactly how the data was obtained, so that analysts can use common methodologies
Creating a metadata policy at the onset of a program will allow teams to easily create and maintain metadata across a platform. It is important to note that this will be an iterative process - no policy will be perfect on day one, and no policy should go unchanged after a few months.
The 5 steps to creating metadata guidelines
- Pick a metadata standard (if desired)
- Determine key inputs to drive metadata
- Create a metadata dictionary to document key metadata fields and formatting
- Document metadata collection process
- Iterate as needed
1) Select a Metadata Standard (If Desired)
There are a number of metadata standards and best practices that have already been developed to act on as a starting point. These generally define not only the fields, but the format of their equivalent values.
There are many metadata standards out there, covering a vast array of uses, dataset foci, target audiences, and other nuances. The below are not meant as an exhaustive collection, but more as a list of the most prominent standards to act as strong starting points for your governments.
Created by the U.S. Government; contains a very thorough fieldset, along with industry-standard formatting of data (e.g., ISO8601 format for dates, etc). Setting up your datasets with Project Open Data also allows your datasets to appear on Data.gov, the United States government’s centralized search engine for open data.
- Here is an example of how a Socrata dataset can show up on Data.gov when metadata is properly formatted
Covers dataset-level metadata, as well as other non-metadata best practices for architecture and structuring
Metadata standard created by the UK government; separates elements into four categories: Mandatory, Mandatory if applicable, Recommended, and Optional.
2) Determine key inputs to drive metadata
Determining what metadata to collect depends on a number of inputs - there is no one “best” collection of metadata to collect, but depending on a program, it is possible to identify a good starting place.
One key principle is that this is an iterative process - consider the below, but leave breathing room so that in six months it can be revisited and determined if too much or little information was collected.
Who will be the primary audience of a data program? Who will the primary audience of individual datasets be? Remember that it is possible to set certain metadata fields to “Private” so that only users internal to an organization can view it. For example, if there is a desire to add an internal contact for a dataset, but don’t want the general public to be able to see it.
Also factor in data owners, striking a balance of not putting an undue burden on the data owners who want to submit data to the program.
Take a look at existing data collection and publishing processes - what are the strengths? Possibly more importantly, what are the weaknesses? Are there any legal requirements or obligations for the data to report?
3) Document metadata fields and formats in a data dictionary
Once the determination has been for the inputs that go into selecting metadata fields, it’s time to document the decision-making process.
Creating a metadata dictionary helps an organization determine and standardize the two basic questions of metadata for each dataset on a platform:
- What fields should be displayed?
- How should these fields be populated?
These are, in theory, simple questions, but there’s nuance to them that a metadata policy would address and preemptively answer questions from users, both internal and external.
For example, there may be a list of 10-20 fields that should be populated when a dataset is first uploaded. One of these fields may be “Date of Last Data Update.” This seems simple enough, but what should data owners put in the field? They can choose from:
- December 31st, 2017
By having a documented dictionary that can be distributed to show that the “Date of Last Data Updated” field should be in the format “MM/DD/YYYY”, it can help save time and effort by removing any guesswork within an organization.
4) Document the Metadata Collection Process
Now that it's been determined what metadata to collect and how to format it, it’s time to figure out how it will be collected from data owners, and ingressed to your organization’s data platform.
This step will work closely with the “Publishing” pillar of the DCoE, because this collection and publishing will often happen as part of the internal data solicitation process and publishing. Having this process documented is critical so that the DCoE task force/committee understands not just the process itself, but who is in charge of each step.
5) Review and Update as needed
All of the above steps can, will, and should evolve over time. A data program’s needs when it launches will likely change, so make sure metadata collection procedures are set up with the ability and expectation to evolve.
Develop a schedule for reviewing the program - quarterly or every six months should be frequent enough. What metadata has been valuable to your internal and external users?
By using the Socrata platform’s Asset Inventory, you can see which metadata fields are generally not filled out by data owners, and determine if there should be more of an effort to collect the data—or to stop collecting it.