Upgrading the open data infrastructure

It’s been nice to see some recent online discussion around the power of blockchain technologies and how they might play a role in the open data movement. Replicating bitcoin may not be the right path, but some of the benefits of blockchain, peer-to-peer technologies, smart contracts, and mp3s could be applied to enrich open data technologies, improving measurement, authenticity, distribution, metadata, and more.

A more complex approach can provide more value

Investments in open data — particularly by governments — are leaps of faith supported by anecdotal stories of value and a desire to be innovative. Measuring impact remains a major challenge, and requires even more investment so that reliable best practices can be developed. Understanding the impact of open data means obtaining insights into where the data goes, how it is used, and the changes that result from its use. New approaches to data sharing, such as smart contracts with opt-out phone-home reporting, can help embed measurement into the very fabric of the open data movement.

Trusting open data is relatively easy at the moment; when we get it from a trusted source (data portal, API, etc), we assume that the information is authentic. When a friend or colleague passes along a data file, we usually trust that they haven’t manipulated it, but the only way to really verify that is to compare it with the official source. Similarly, when data has been intentionally manipulated by, for example, removing rows that aren’t relevant for a specific analysis, it’s usually not possible to reconstruct what was done without comparing it to an unmodified copy. To make matters more complicated, original sources are periodically updated (sometimes every 5 minutes!), making proper verification impossible. This is a challenge that a variation of blockchain can help solve.

Solving the impact and trust challenges allows the way open data is hosted and distributed to change. Under the current approach, data providers need operate their own web sites (portals) to provide some degree of authenticity, but with a new approach it wouldn’t be necessary. Operating models, such as regional and academic partnerships, library hosting, informal provisioning, conglomeration, or even distributed networks would become easier to set up and sustain. We can leverage the benefits of tools like Gnutella and BitTorrent to make open data flow to anywhere on the planet where it will be used.

If we’re going to re-engineer how data is distributed, authenticated, and metered, we shouldn’t forget about metadata. GIS professionals in particular have been sharing data for decades, and metadata is required just to interpret the data correctly. But CSVs are a lot like raw audio files — the best to be hoped for is a meaningful filename and helpful column headers (or JSON object names). We should be taking inspiration from modern digital music formats, or better yet, datapackages.

A more complex approach will reduce open data use

The biggest obstacle to getting all these benefits is leaving behind technologies which are much easier to use. Temporarily setting aside the valid concerns of data divide, why would data consumers and suppliers move away from plain-text access and basic web access mechanisms to something more complex? Here are a few thoughts:

  1. Portability, authenticity, provenance, and metadata aren’t just challenges for the open data community to solve, but if we did take them on, we’d not only be changing the entire data industry, we’d ultimately be creating a fundamental architecture for the Internet of the Future™. The current growth of distributed digital ledgers in the financial sector is an example of this wave of change on the horizon.
  2. Providers and consumers of larger volumes of data benefit from transaction streams rather than repeated requests for the full content. Once the hurdle of getting a baseline copy is passed, only changes to the data need to be applied and verified.
  3. The benefits of making open data more technically complex might be so valuable that suppliers and consumers are willing to go the extra mile until the new technologies become mainstream. Specialized software was needed to listen to MP3s when they rose to popularity, but practically every operating system has built-in support for them now.

What do you think? Do the benefits outweigh the disadvantages? Are there other ways to lower the barriers which come with increased technical complexity? Are there other models we can incorporate to ensure open data provides the greatest value possible to the broadest audience?