A publishing process creates the transparency and consistent collaboration needed to provide a successful data platform. The goal is to provide up-to-date data sharing with a solid workflow that includes automation, a strong architecture and good documentation. Data publishing should not be seen as just the “upload and automate” steps, but the end-to-end philosophy of uploading data, including the initial identification of the datasets as candidates to review for publication.
For the purposes of this document, we will break this topic into two high-level buckets:
- Data Review: The initial proposal, metadata identification, and internal reviews,
- Data Publishing: The more technical Extraction, Transformation, and Load process.
Before beginning to think about how to upload and automate an organization’s data, it is critical it develop a structured and documented process’ for departments to provide all possible information, including a sample of the data, to the DCoE for review. This ensures that by the time the data is actually presented to the technical team, which will be doing the ingress, not only are all of the technical questions answered, but there is a greater awareness of the data internally and alignment of the data with an organization’s internal goals.
Developing a Data Submission Template
Before publishing data to a shared server or location, there is the need to create a Data Submission Template. This template will be used by dataset owners to provide an organization with four main types of information about the data, or metadata: Internal Review Metadata, Public-facing Metadata, Internal-facing Metadata, and a sample of the data.
Metadata for Internal Review Purposes
Think of this as the abstract or brief of the dataset for the owners to present to the DCoE:
- What is this dataset
- What will it show
- Who is the target audience, and?
- How does it align with your organization’s goals?
Public-Facing Metadata for Publishing
This is information about the dataset that will help the end-user best understand the data to allow it to be used most effectively. How often will the dataset be updated? What does each row of data signify? Who should users contact with questions? What does each column name mean? Collecting this data up-front from the data owners will help them better think through their data, and will reduce confusion of misinterpretation of the data by end users.
Internal-Facing Metadata for Publishing
Similar to the public-facing information, collect metadata that would be useful for internal users, but may be either unnecessary or too sensitive for the public to see.
A Sample of the Data
Note the key word “sample” - don’t make data owners pull a full dataset right off the bat. The goal here is to identify what the general structure of the dataset will look like, including the number of columns, column names, and the formatting of each column. It should also include a small selection of data (think 50-100 rows).
In addition to the structure, pulling a small sample will help identify challenges that may occur - Personally Identifiable Information (PII), incomplete columns, and improperly formatted data can be spotted from the onset, which will help the data team be able to plan and adjust the dataset. Once the dataset is approved for publishing, then the team can focus on pulling the entire dataset.
Tips for Designing a Data Submission Template
The main thing to consider when designing the template is to make it as easy as possible for data owners to fill out - if it’s too complex, data owners may not want to go through the trouble. There is then, understandably, a balance between collecting enough information, and collecting too much information.
One way to try to work through this is to not simply prompt the user for the values in a given field, but to ask them more naturally phrased questions, so it feels less like an activity of filling out a form, and more like a survey. For example, instead of prompts for “Contact Name” and “Update Frequency”, ask “How often will this data be updated” and “what position will be in charge of updating it?”. This will be less intimidating to users, and can help the data owners think through their data processes.
Also, make sure to refer to positions rather than people where possible. This way, continuity of ownership is easier if someone who is in charge of a dataset leaves the organization.
Submitting and reviewing the template
After completion of the template, then data owners should fill it out prior to any dataset they submit for consideration. A designated contact should review the submission to ensure any immediate gaps are noticed (missing fields, incorrect formats, etc) before it goes to the larger group.
After this initial review, the template should be reviewed by the larger DCoE. This should include the ability for departments to identify potential inter-departmental use cases for the data, or any feedback for the data owners to either approve the data, or to add (or remove) information.
This process will definitely be iterative. At first, data owners will likely need a few tries to figure out the process, how it works, and what information is critical. To get ahead of this, prior to the first “official” approval process, go through a number of “dry runs” with datasets that are deemed as exemplary or necessary for inclusion. This allows for submitter to reference a positive example, but also the DCoE can understand and discover gaps in the process before it becomes public to the rest of the internal organization.
Once the DCoE has established what a dataset will look like, including the metadata, it can then approach the data custodians for publishing. The data custodians are the ones with the keys to the data and can unlock the data from its source and help with publication. Successful data deployment to users is dependent on publishing data flows consistently and in an automated fashion. Publishing encompasses the process from a source system to an enterprise data platform. Three specific considerations to focus on when creating a data publishing process are:
- Data Extraction
- Data Transformation
- Data Loading (publishing)
This process is also referred to as Extract, Transform, Load or ETL. This is the process from which data is Extracted a source system in a raw format, Transformed it into something more meaningful to users, and Loaded it into a data portal or an asset that powers an application. Though one may not be conducting all of these steps, this is the likely procedure one will take to publish data.
There are many ways to ETL and it depends on the source data, the necessary transformations and the resources available to do it. When deciding on the best ETL tool(s) for an organization, it’s important to consider the existing workflow from source to end user. With that information, it is recommended to lead a discussion on how best to ingress data to the enterprise platform in the future. This discussion should discover:
- Current source systems
- Source update frequencies
- Currently available ETL tools
- Systems already on IT’s, analysts’ or others’ wish lists
- Expertise and personnel available to operate these tools
This discussion may naturally lead to an in depth understanding of the source systems that house legacy data or knowledge gaps around them, these should both be documented. Source systems vary by entity and within each entity every department may have multiple source systems. If changing source systems is not being discussed along with the ETL process, the DCoE should think about how to leverage current source systems using the following:
- Custom scripts (python, R, etc)
- ETL tools (FME, Pentaho Kettle, etc)
- DataSync (for Socrata only)
- Operating System specific automation tools
After learning about the tools available and the resources required to operate them it is possible to develop the organization’s customized publishing workflow based on user needs including:
- Volume of data to be published
- Consider the number of assets to be published/updated
- Consider the file size and if it is unwieldy for the end user
- Level of transformation between the source and your data platform
- Hiding PII
- Deriving useful data from the source (e.g. derive day of week from a date field as an extra column)
- Aggregation where needed
- Required metadata updates for the assets
- All assets may not need the same metadata schema
- All levels of appropriate metadata are maintained
- Publication cadence for each asset
- Should not be more frequent than source data is updated
- New views/filters/visualizations to be created following publication
- Ensure data types are chosen properly
- Data is validated for expected outputs
This process is living and breathing, it needs to be maintained and reviewed based on resource availability and user needs. During future procurement, the process should be considered as some systems have extra costs for custom data extracts, API usage, etc. As new assets are created, and existing ones grow, processes must be reconsidered if it is not sufficiently scalable.
Automation brings benefits to data publishing by reducing person hours, removing human error, accomplishing tasks out of hours and more. However, not all data publishing processes should be automated. Setting up a successful automation process takes resources and it should be considered whether these costs outweigh the benefits. Think about the following questions when thinking about automating your publishing process:
- What is the level of effort to automate the process?
- What does manual processing cost?
- Are there resources to complete the task manually?
- How often does the task need to be carried out?
- How long will it take to test and debug the automated process?
- What is the life expectancy of the automation process?
- Is there a large chance of human error that can be eliminated by automation?
- Does this task need to be done manually because a machine cannot successfully do it without human intervention (validation, requires SME, etc)?
Data pipeline and publishing process may be housed centrally and flow through a single coordinated team responsible for the cradle to grave process, or it may be a disparate number of teams all contributing to multiple repositories. In any situation along this spectrum, the data pipeline should not be tied to IT hardware that is exclusive to a single person like a laptop or desktop computer.
Working with IT, the pipeline should live on a networked server so that automated processes have sufficient:
- Hardware resources
- Scheduling capabilities
- Network permissions
- Software licenses
- IT support with reliability
Diagramming a Publishing Process is an exercise to see the who/what/where/why/how that revolve around the data pipeline. It creates the transparency and consistent collaboration needed to provide a successful process. It tells a visual story of how take data and direct it to an audience. It will bring other parties to the table to identify missing links that may have been forgotten. This diagram should include:
- Source system(s)
- Mechanism to deliver the raw data to an analyst/SME
- Processes carried out by an analyst/SME including transformation or validation
- Approvals required prior to publication
- Method to update the data asset
- Metadata updates
- Required resources (personnel and software) needed to complete the above tasks
How To Document
Documenting the Publishing Process is essential to the longevity of a data process, if the methodology is lost then the outputs used by users will soon follow. Documenting the data publishing process creates a blueprint for others to follow when adding assets to the data portfolio, onboarding new staff and finding areas for improvement. Quality documentation includes the following topics and the “why” for each of them:
- Updated list of personnel and their responsibilities
- Documented source systems
- Required transformations
- Destination data assets
- Automation requirement
This is a real example of how a team of Crime Analysts within a city police department were able to review their publishing process from start to finish and revise it to be more streamlined, allowing them to actually analyze data instead of just report it. Source data for the police department comes from multiple sources, often in free text and is unchecked before being part of the data flow. From there, analysts had limited queries to extract the data and even then the data had to be thoroughly cleansed and manual outputs created for use in a COMPSTAT, which was delivered weekly to the police chief.
Producing this COMPSTAST took up most of the analysts’ work week, which did not allow them much time for analyzing data. The limited queries they had access to did not provide them with adequate data and there was no support from management to change the process that governed the poor data quality and lack of access. Though the analysts could reinsert cleansed data into the source system, they could not control how it entered the system the first time.
After gathering the analysts, their manager, the open data executive sponsor and the police chief in the same room, this publishing system became clearly ineffective and did not let the analysts do high-quality work. The outcome was to allow the crime analysts more input on the controls that regulate the incoming data pipeline and move the data into a data portal that gave them unlimited access to the data. With that data, they could create an automated dashboard to function as their COMPSTAT and have time to analyze data to create actionable intelligence that leads to police activity.
In the future, this data driven police activity can be monitored through performance metrics and analysts can target specific crimes and areas that need attention. By removing the barriers between the Crime Analysts and the source data pipeline, they can iterate on this process as their and the department’s needs change.