Data Quality Standards
The goal of data quality standards is to provide consistent quality data to tell valid stories with the data and ensure the data that’s driving these stories is consistent with the organization’s goals. There are two branches to this pillar, the Standard Data Governance that should be applied to the organization and the Standard Data Measures that should be applied to the data itself.
Standard Data Measures
Once the data has been approved by the DCoE, it should be asked if the current state of the data contents is standardized in a way that will provide value to the users. The themes, or measures, that should be considered are data standardization, readiness and completeness. Examining these measures and using them to improve data are the final steps in the path to having quality data.
Data Standardization Checklist
Standardized data is much easier to analyze and use to make data driven decisions because it has been validated against quality controls. This means for each dataset criteria and trigger points must be established to determine if the data can be considered validated or quality. Consider asking the following questions of the data to build criteria and weigh how important it is that every record meets this criteria:
Does your data have good hygiene?
- Has it been inspected for outliers?
- Is there any Personally Identifiable Information (PII)?
- Are your placeholders consistent? (N/A, n/a, UNK)
- Has whitespace been removed?
Does your data pass a gut check?
- Do simple statistics (mean, median, max, min, mode, range) raise any flags?
- Do outliers provide an insight?
- Can the data be trusted?
Does your data play by the rules?
- Do values match against lookup list?
- If business rules can be applied, does the data pass?
- Is the type case or precision as prescribed?
Does your data create a time warp?
- Are all dates in the same format?
- Do dates and times have the same precision?
- Does the date range make sense?
Does your data get frequent checkups?
- When is your data validated?
- What happens if your data does not pass validation?
- Who does your data validation?
- Is your validation process repeatable, should it be automated?
- Can you look upstream to improve source data?
Validating data against this criteria can be done in a number of ways including database tools, Microsoft Excel, FME, custom scripts, a GIS platform or any other preferred tool. To ensure clean and standard data being created in the future, work with data owners to make sure that no garbage gets into the source data to prevent garbage from coming out. These controls may include:
- Multiple choice entries instead of free text
- Data types are required for data entry (e.g. only numbers can be input into a cost field)
- Numeric/Date validation that values must be within valid ranges
- Fields require specified level of precision
- Running a statistical check at regular intervals to check for outliers
Data Readiness Checklist
The concept of data readiness overlaps other the concepts of a good data flow, it is an overarching element that should be looked at from the macro and micro scales. It encompasses all of the steps to not just get data from source to end user, but to make the data appropriate for the end user.
Has that data flow been mapped?
- Is your source system accessible?
- What (if any) transformation needs to happen from the source?
- What approvals or validation need to occur before publishing?
- If the data does get approved or validated, what happens next?
Is your data accessible?
- Does the data need to be decoded for the layperson?
- Do column names make sense?
- Do more silos need to be broken down to create robust data?
- Are the correct data types used to present the data and allow easy visualization and analysis?
What happens to data after it is published for the first time?
- Does it need to be automated?
- How often should the data be validated/updated?
- Who will steward the published data and the metadata?
These questions don’t need to all be answered at once or have the best answer possible. The processes that govern the responses will take time and iteration to modify and improve data readiness. Some of the things an organization can do is institute uniform methods throughout its program from source to end user including:
- Implementing standard column names across the organization
- Address, vs Full Address, vs ADDRESS, etc
- Use of shared data inputs to avoid duplication of effort or compounding errors
- Addresses are pulled from a validated database so syntax is consistent and invalid addresses are not used
- Shared data coding/decoding practices to instill consistency
- Address locations are obfuscated in the same way, phone numbers are stored/displayed the same, common systems use the same data dictionaries
Data Completeness checklist
The data provided will be able to answer more questions if it is more complete. Ensuring content, lifespan and accuracy an audience needs will enable users to have an enhanced picture of the topic at hand. You should ask yourself:
Does the data meet user needs?
- Will the data answer users’ questions?
- Does the data cover a continuous time series or have full spatial distribution?
- Is the data’s granularity meaningful?
- Will it be useful into the future?
Are there gaps in the data?
- Are there missing events that users should know about?
- Are placeholders used correctly (N/A, vs 0, vs blank?)
- Does metadata provide users with context for data accuracy and precision?
Is the data well structured?
- Is there a schema that allows the data to be benchmarked against other organizations?
- Are correct data types applied?
- Can the records be cleaner?
The Data Quality Standards pillar looks at the state of data from source to end user. Though good data quality starts at the source, there are various stages it can be modified before it reaches the end user. Enterprise level interventions can pay dividends the closer to the data source they are implemented Sometimes they are not practical or scalable, so transformations can happen at an appropriate place within the process.
When possible, it’s reasonable to duplicate what other organizations have done, especially when making changes at the department or even enterprise level. Industry leaders often share their best practices on given subject matter and have plenty of lessons learned to share. When focusing on particular datasets, it’s helpful to consult industry subject matter experts to discover if there is an existing data quality standard for that dataset. The Open Data Standards Directory has compiled many government related data schemas that will help an organization find more insight into its data and allow for data comparison over multiple organizations.