Data influences our lives in a myriad of ways every day. We use it to choose restaurants, get us to destinations faster, combat gun violence, optimize our food and water supplies, and predict the likelihood of reincarceration. Electronic information is increasingly a north star for personal decisions, companies, and policymakers whose decisions have long-lasting effects on every member of our society. But as we continue to place our lives firmly in the hands of the technologies we create, it is important not just to ensure the algorithms work well, but also that the data which feeds them is good enough to keep them reliable. Even with that oh-so-rare perfectly working algorithm, poor data quality means poor decisions.
Where did it all go wrong?
It’s not really anyone’s fault, but there are a few common causes of poor data quality.
Spreadsheets, spreadsheets, spreadsheets. The moment you put “soon” or “next month” into a column you intended to use for due dates, it reduces the quality and can make the data far less useful, especially if that same column is used to generate graphs, drive formulas, or support other analyses. A further challenge with spreadsheets is that they are commonly used to produce reports that are designed for the screen or printed page. Adding rows for subtotals and totals in the middle of your data make it much harder use it for other purposes. Although nearly all spreadsheet tools are designed to be powerful and flexible, this comes at a cost of producing very poor quality data.
Multiple “master” lists of the same thing. Urban planning teams, tax collectors, and building inspectors all use real estate data (addresses, parcels, etc.) to do their jobs. However, it’s quite common for each department to keep copies of that data in separate technology silos. Complicating matters even more, each silo has its own lingo for describing things and those differences are reflected in their information systems. For example, an urban planner might classify a 16-unit property as multi-unit residential; a tax collector might have 16 records, one for each of the condominiums and the building inspector might track that there are 4 apartments on each of 4 floors.
Our needs change over time, but our data doesn’t. During my years managing an IT shop, we set up a simple, open-source tool to track help desk requests. We created a list of common types of work and used those to categorize our calls. It’s no secret that IT changes quickly, so it’s not surprising that the types of work we chose didn’t meet our needs for very long. We split some types, merged others together, and of course, created completely new ones. When we did this, however, it wasn’t practical to go back into our older records and revise them to fit with the newer types. Since we used these request types to monitor trends, understanding our organizational needs over longer periods of time became very challenging.
What can we do to improve?
The most important step you can take is starting to treat your data like a strategic asset. Just like a flat tire stops an ambulance from responding to an emergency, poor quality data affects people’s lives. Although there may be people in your organization dedicated to supporting data systems, keeping data quality at a high level takes collaboration by everyone.
If you are entering information into a spreadsheet, ask yourself if someone else could work with that spreadsheet if you weren’t there to explain it. Microsoft Excel is one of the most commonly used tools to collect, manage, and analyze data, but even Microsoft recommends using different software if you need more accurate information. If you must keep data in a spreadsheet, include a data dictionary and follow these practices.
Resolving the challenge of multiple master lists often requires the coordination and collaboration of the various teams who work with separate copies of the same data. The goal of this work should be to ensure that the isolated data sets can be sustainably reconciled, and if possible, merged together. Doing so allows all the other data related to those master lists be connected as well. IT staff are valuable to this process, as they usually understand the strengths and weaknesses of the involved software or systems.
Finally, although it’s difficult to predict how you will need to use your data in future years, documenting your choices for categories, types, or codes is critical from the beginning. Even if you don’t go back and revise your older data to conform to modern needs (and in many cases that may not be acceptable anyway), at least there will be materials to rely on when it comes to developing reliable analyses and insights.
If you think data quality affects your organization’s ability to provide critical services, start by listening to our podcast episode and office hours with Stephanie Singer for more information. Also check out the new Data Quality Guide for Governments, and related toolkits which Stephanie has assembled.